[ceph-users] osd cpu usage is bigger than 100%

2014-09-11 Thread yuelongguang
hi,all
i am testing   rbd performance, now there is only one vm which is using  rbd as 
its disk, and inside it  fio is doing r/w.
the big diffenence is that i set a big iodepth other than iodepth=1.
according to my test,  the bigger iodepth, the bigger cpu usage. 
 
analyse  the output of top command.  
1. 
12% wa,  if it means disk speed is not fast enough?
 
2. from where  we  can know  whether ceph's number of threads  is  enough or 
not?
 
 
how do you think about it,  which part is using up cpu? i want to find the root 
cause, why big iodepth leads to high cpu usage.
 
 
---default options
osd_op_threads": "2",
  "osd_disk_threads": "1",
  "osd_recovery_threads": "1",
"filestore_op_threads": "2",
 
 
thanks
 
--top---iodepth=16-
top - 15:27:34 up 2 days,  6:03,  2 users,  load average: 0.49, 0.56, 0.62
Tasks:  97 total,   1 running,  96 sleeping,   0 stopped,   0 zombie
Cpu(s): 19.0%us,  8.1%sy,  0.0%ni, 59.3%id, 12.1%wa,  0.0%hi,  0.8%si,  0.7%st
Mem:   1922540k total,  1853180k used,69360k free, 7012k buffers
Swap:  1048568k total,76796k used,   971772k free,  1034272k cached
  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND

   
 2763 root  20   0 1112m 386m 5028 S 60.8 20.6 200:43.47 ceph-osd 
 
 -top
top - 19:50:08 up 1 day, 10:26,  2 users,  load average: 1.55, 0.97, 0.81
Tasks:  97 total,   1 running,  96 sleeping,   0 stopped,   0 zombie
Cpu(s): 37.6%us, 14.2%sy,  0.0%ni, 37.0%id,  9.4%wa,  0.0%hi,  1.3%si,  0.5%st
Mem:   1922540k total,  1820196k used,   102344k free,23100k buffers
Swap:  1048568k total,91724k used,   956844k free,  1052292k cached

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND

   
 4312 root  20   0 1100m 337m 5192 S 107.3 18.0  88:33.27 ceph-osd  

   
 1704 root  20   0  514m 272m 3648 S  0.7 14.5   3:27.19 ceph-mon  

 

--iostat--

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
vdd   5.50   137.50  247.00  782.00  2896.00  8773.0011.34 
7.083.55   0.63  65.05

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
vdd   9.50   119.00  327.50  458.50  3940.00  4733.5011.03
12.03   19.66   0.70  55.40

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
vdd  15.5010.50  324.00  559.50  3784.00  3398.00 8.13 
1.982.22   0.81  71.25

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
vdd   4.50   253.50  273.50  803.00  3056.00 12155.0014.13 
4.704.32   0.55  59.55

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
vdd  10.00 6.00  294.00  488.00  3200.00  2933.50 7.84 
1.101.49   0.70  54.85

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
vdd  10.0014.00  333.00  645.00  3780.00  3846.00 7.80 
2.132.15   0.90  87.55

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
vdd  11.00   240.50  259.00  579.00  3144.00 10035.5015.73 
8.51   10.18   0.84  70.20

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
vdd  10.5017.00  318.50  707.00  3876.00  4084.50 7.76 
1.321.30   0.61  62.65

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
vdd   4.50   208.00  233.50  918.00  2648.00 19214.5018.99 
5.434.71   0.55  63.20

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
vdd   7.00 1.50  306.00  212.00  3376.00  2176.5010.72 
1.031.83   0.96  49.70




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] why one osd-op from client can get two osd-op-reply?

2014-09-11 Thread yuelongguang
as for the second question, could you tell me where the code is.

how ceph makes size/min_szie copies?
 
thanks









At 2014-09-11 12:19:18, "Gregory Farnum"  wrote:
>On Wed, Sep 10, 2014 at 8:29 PM, yuelongguang  wrote:
>>
>>
>>
>>
>> as for ack and ondisk, ceph has size and min_size to decide there are  how
>> many replications.
>> if client receive ack or ondisk, which means there are at least min_size
>> osds  have  done the ops?
>>
>> i am reading the cource code, could you help me with the two questions.
>>
>> 1.
>>  on osd, where is the code that reply ops  separately  according to ack or
>> ondisk.
>>  i check the code, but i thought they always are replied together.
>
>It depends on what journaling mode you're in, but generally they're
>triggered separately (unless it goes on disk first, in which case it
>will skip the ack — this is the mode it uses for non-btrfs
>filesystems). The places where it actually replies are pretty clear
>about doing one or the other, though...
>
>>
>> 2.
>>  now i just know how client write ops to primary osd, inside osd cluster,
>> how it promises min_size copy are reached.
>> i mean  when primary osd receives ops , how it spreads ops to others, and
>> how it processes other's reply.
>
>That's not how it works. The primary for a PG will not go "active"
>with it until it has at least min_size copies that it knows about.
>Once the OSD is doing any processing of the PG, it requires all
>participating members to respond before it sends any messages back to
>the client.
>-Greg
>Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>>
>>
>> greg, thanks very much
>>
>>
>>
>>
>>
>> 在 2014-09-11 01:36:39,"Gregory Farnum"  写道:
>>
>> The important bit there is actually near the end of the message output line,
>> where the first says "ack" and the second says "ondisk".
>>
>> I assume you're using btrfs; the ack is returned after the write is applied
>> in-memory and readable by clients. The ondisk (commit) message is returned
>> after it's durable to the journal or the backing filesystem.
>> -Greg
>>
>> On Wednesday, September 10, 2014, yuelongguang  wrote:
>>>
>>> hi,all
>>> i recently debug ceph rbd, the log tells that  one write to osd can get
>>> two if its reply.
>>> the difference between them is seq.
>>> why?
>>>
>>> thanks
>>> ---log-
>>> reader got message 6 0x7f58900010a0 osd_op_reply(15
>>> rbd_data.19d92ae8944a.0001 [set-alloc-hint object_size 4194304
>>> write_size 4194304,write 0~3145728] v211'518 uv518 ack = 0) v6
>>> 2014-09-10 08:47:32.348213 7f58bc16b700 20 -- 10.58.100.92:0/1047669 queue
>>> 0x7f58900010a0 prio 127
>>> 2014-09-10 08:47:32.348230 7f58bc16b700 20 -- 10.58.100.92:0/1047669 >>
>>> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
>>> c=0xfae940).reader reading tag...
>>> 2014-09-10 08:47:32.348245 7f58bc16b700 20 -- 10.58.100.92:0/1047669 >>
>>> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
>>> c=0xfae940).reader got MSG
>>> 2014-09-10 08:47:32.348257 7f58bc16b700 20 -- 10.58.100.92:0/1047669 >>
>>> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
>>> c=0xfae940).reader got envelope type=43 src osd.1 front=247 data=0 off 0
>>> 2014-09-10 08:47:32.348269 7f58bc16b700 10 -- 10.58.100.92:0/1047669 >>
>>> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
>>> c=0xfae940).reader wants 247 from dispatch throttler 247/104857600
>>> 2014-09-10 08:47:32.348286 7f58bc16b700 20 -- 10.58.100.92:0/1047669 >>
>>> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
>>> c=0xfae940).reader got front 247
>>> 2014-09-10 08:47:32.348303 7f58bc16b700 10 -- 10.58.100.92:0/1047669 >>
>>> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
>>> c=0xfae940).aborted = 0
>>> 2014-09-10 08:47:32.348312 7f58bc16b700 20 -- 10.58.100.92:0/1047669 >>
>>> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
>>> c=0xfae940).reader got 247 + 0 + 0 byte message
>>> 2014-09-10 08:47:32.348332 7f58bc16b700 10 check_message_signature: seq #
>>> = 7 front_crc_ = 3699418201 middle_crc = 0 data_crc = 0
>>> 2014-09-10 08:47:32.348369 7f58bc16b700 10 -- 10.58.100.92:0/1047669 >>
>>> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
>>> c=0xfae940).reader got message 7 0x7f5890003660 osd_op_reply(15
>>> rbd_data.19d92ae8944a.0001 [set-alloc-hint object_size 4194304
>>> write_size 4194304,write 0~3145728] v211'518 uv518 ondisk = 0) v6
>>>
>>>
>>
>>
>> --
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>>
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question about librbd io

2014-09-11 Thread yuelongguang
hi, josh durgin:
 
please look at my test.   inside vm using fio to test rbd performance.
fio paramters: dircet io, bs=4k, iodepth >> 4
from the infomation below, it does not match.
avgrq-sz is not  approximately  8,
for avgqu-sz   , its value is small and ruleless, lesser than 32.   why?
in ceph , which part maybe gather/scatter io request.  why avgqu-szis so 
small?
 
let's work it out.  haha
 
thanks
 
iostat-iodepth=32-- blocksize=4k--
Linux 2.6.32-358.el6.x86_64 (cephosd4-mdsa) 2014年09月11日  _x86_64_(2 
CPU)
Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
vdd   0.12 5.818.19   35.39   132.09   670.6518.42 
0.317.06   0.55   2.41
Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
vdd   0.00   291.500.00 1151.00 0.00 13091.5011.37 
5.064.40   0.23  26.35
Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
vdd   0.00   208.500.00 1020.00 0.00  8294.50 8.13 
2.522.47   0.39  39.30
Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
vdd   0.0036.000.00 1076.00 0.00 17560.0016.32 
0.600.56   0.30  32.30
Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
vdd   0.00   242.500.00 1143.00 0.00 22402.0019.60 
3.783.31   0.25  28.90
Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
vdd   0.0031.000.00  906.50 0.00  5351.50 5.90 
0.370.40   0.28  25.70
Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
vdd   0.00   294.500.00 1148.50 0.00 16620.5014.47 
4.493.91   0.21  24.60
Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
vdd   0.0026.500.00  810.50 0.00  4922.50 6.07 
0.370.45   0.35  28.35
Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
vdd   0.0045.500.00 1022.00 0.00  6117.00 5.99 
0.380.37   0.28  28.15
Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
vdd   0.00   300.000.00 1155.00 0.00 16997.5014.72 
3.583.10   0.21  24.30
Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
vdd   0.0027.000.00  962.50 0.00  6846.50 7.11 
0.440.46   0.35  33.60
Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
vdd   0.00   270.000.00 1249.50 0.00 14400.0011.52 
4.613.69   0.25  31.25
Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
vdd   0.0015.003.00  660.0024.00  4247.00 6.44 
0.380.57   0.45  29.60
Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
vdd   0.0017.00   24.50  592.50   196.00  8039.0013.35 
0.580.94   0.83  51.05
 









在 2014-09-10 08:37:23,"Josh Durgin"  写道:
>On 09/09/2014 07:06 AM, yuelongguang wrote:
>> hi, josh.durgin:
>> i want to know how librbd launch io request.
>> use case:
>> inside vm, i use fio to test rbd-disk's io performance.
>> fio's pramaters are  bs=4k, direct io, qemu cache=none.
>> in this case, if librbd just send what it gets from vm, i mean  no
>> gather/scatter. the rate , io inside vm : io at librbd: io at osd
>> filestore = 1:1:1?
>
>If the rbd image is not a clone, the io issued from the vm's block
>driver will match the io issued by librbd. With caching disabled
>as you have it, the io from the OSDs will be similar, with some
>small amount extra for OSD bookkeeping.
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd crash: trim_objectcould not find coid

2014-09-11 Thread Francois Deppierraz
Hi Greg,

An attempt to recover pg 3.3ef by copying it from broken osd.6 to
working osd.32 resulted in one more broken osd :(

Here's what was actually done:

root@storage1:~# ceph pg 3.3ef list_missing | head
{ "offset": { "oid": "",
  "key": "",
  "snapid": 0,
  "hash": 0,
  "max": 0,
  "pool": -1,
  "namespace": ""},
  "num_missing": 219,
  "num_unfound": 219,
  "objects": [
[...]
root@storage1:~# ceph pg 3.3ef query
[...]
  "might_have_unfound": [
{ "osd": 6,
  "status": "osd is down"},
{ "osd": 19,
  "status": "already probed"},
{ "osd": 32,
  "status": "already probed"},
{ "osd": 42,
  "status": "already probed"}],
[...]

# Exporting pg 3.3ef from broken osd.6

root@storage2:~# ceph_objectstore_tool --data-path
/var/lib/ceph/osd/ceph-6/ --journal-path
/var/lib/ceph/osd/ssd0/6.journal --pgid 3.3ef --op export --file
~/backup/osd-6.pg-3.3ef.export

# Remove an empty pg 3.3ef which was already present on this OSD

root@storage2:~# service ceph stop osd.32
root@storage2:~# ceph_objectstore_tool --data-path
/var/lib/ceph/osd/ceph-32/ --journal-path
/var/lib/ceph/osd/ssd0/32.journal --pgid 3.3ef --op remove

# Import pg 3.3ef from dump

root@storage2:~# ceph_objectstore_tool --data-path
/var/lib/ceph/osd/ceph-32/ --journal-path
/var/lib/ceph/osd/ssd0/32.journal --op import --file
~/backup/osd-6.pg-3.3ef.export
root@storage2:~# service ceph start osd.32

-1> 2014-09-10 18:53:37.196262 7f13fdd7d780  5 osd.32 pg_epoch:
48366 pg[3.3ef(unlocked)] enter Initial
 0> 2014-09-10 18:53:37.239479 7f13fdd7d780 -1 *** Caught signal
(Aborted) **
 in thread 7f13fdd7d780

 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
 1: /usr/bin/ceph-osd() [0x8843da]
 2: (()+0xfcb0) [0x7f13fcfabcb0]
 3: (gsignal()+0x35) [0x7f13fb98a0d5]
 4: (abort()+0x17b) [0x7f13fb98d83b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f13fc2dc69d]
 6: (()+0xb5846) [0x7f13fc2da846]
 7: (()+0xb5873) [0x7f13fc2da873]
 8: (()+0xb596e) [0x7f13fc2da96e]
 9: /usr/bin/ceph-osd() [0x94b34f]
 10:
(pg_log_entry_t::decode_with_checksum(ceph::buffer::list::iterator&)+0x12c)
[0x691b6c]
 11: (PGLog::read_log(ObjectStore*, coll_t, hobject_t, pg_info_t const&,
std::map,
std::allocator > >&, PGLog::IndexedLog&, pg_missing_t&,
std::basic_ostringstream,
std::allocator >&, std::set, std::allocator >*)+0x16d4) [0x7d3ef4]
 12: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x2c1) [0x7951b1]
 13: (OSD::load_pgs()+0x18f3) [0x61e143]
 14: (OSD::init()+0x1b9a) [0x62726a]
 15: (main()+0x1e8d) [0x5d2d0d]
 16: (__libc_start_main()+0xed) [0x7f13fb97576d]
 17: /usr/bin/ceph-osd() [0x5d69d9]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

Fortunately it was possible to bring back osd.32 into a working state
simply be removing this pg.

root@storage2:~# ceph_objectstore_tool --data-path
/var/lib/ceph/osd/ceph-32/ --journal-path
/var/lib/ceph/osd/ssd0/32.journal --pgid 3.3ef --op remove

Did I miss something from this procedure or does it mean that this pg is
definitely lost?

Thanks!

François

On 09. 09. 14 00:23, Gregory Farnum wrote:
> On Mon, Sep 8, 2014 at 2:53 PM, Francois Deppierraz
>  wrote:
>> Hi Greg,
>>
>> Thanks for your support!
>>
>> On 08. 09. 14 20:20, Gregory Farnum wrote:
>>
>>> The first one is not caused by the same thing as the ticket you
>>> reference (it was fixed well before emperor), so it appears to be some
>>> kind of disk corruption.
>>> The second one is definitely corruption of some kind as it's missing
>>> an OSDMap it thinks it should have. It's possible that you're running
>>> into bugs in emperor that were fixed after we stopped doing regular
>>> support releases of it, but I'm more concerned that you've got disk
>>> corruption in the stores. What kind of crashes did you see previously;
>>> are there any relevant messages in dmesg, etc?
>>
>> Nothing special in dmesg except probably irrelevant XFS warnings:
>>
>> XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
> 
> Hmm, I'm not sure what the outcome of that could be. Googling for the
> error message returns this as the first result, though:
> http://comments.gmane.org/gmane.comp.file-systems.xfs.general/58429
> Which indicates that it's a real deadlock and capable of messing up
> your OSDs pretty good.
> 
>>
>> All logs from before the disaster are still there, do you have any
>> advise on what would be relevant?
>>
>>> Given these issues, you might be best off identifying exactly which
>>> PGs are missing, carefully copying them to working OSDs (use the osd
>>> store tool), and killing these OSDs. Do lots of backups at each
>>> stage...
>>
>> This sounds scary, I'll keep fingers crossed and will do a bunch of
>> backups. There are 17 pg with missing objects.
>>
>> What do you exactly mean by the osd store tool? Is it the
>> 'ceph_filestore_tool' binary?

Re: [ceph-users] question about librbd io(fio paramenters)

2014-09-11 Thread yuelongguang
fio paramenters 
--fio
[global]
ioengine=libaio
direct=1
rw=randwrite
filename=/dev/vdb
time_based
runtime=300
stonewall
 
[iodepth32]
iodepth=32
bs=4k








At 2014-09-11 05:04:09, "yuelongguang"  wrote:

hi, josh durgin:
 
please look at my test.   inside vm using fio to test rbd performance.
fio paramters: dircet io, bs=4k, iodepth >> 4
from the infomation below, it does not match.
avgrq-sz is not  approximately  8,
for avgqu-sz   , its value is small and ruleless, lesser than 32.   why?
in ceph , which part maybe gather/scatter io request.  why avgqu-szis so 
small?
 
let's work it out.  haha
 
thanks
 
iostat-iodepth=32-- blocksize=4k--
Linux 2.6.32-358.el6.x86_64 (cephosd4-mdsa) 2014年09月11日  _x86_64_(2 
CPU)
Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
vdd   0.12 5.818.19   35.39   132.09   670.6518.42 
0.317.06   0.55   2.41
Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
vdd   0.00   291.500.00 1151.00 0.00 13091.5011.37 
5.064.40   0.23  26.35
Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
vdd   0.00   208.500.00 1020.00 0.00  8294.50 8.13 
2.522.47   0.39  39.30
Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
vdd   0.0036.000.00 1076.00 0.00 17560.0016.32 
0.600.56   0.30  32.30
Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
vdd   0.00   242.500.00 1143.00 0.00 22402.0019.60 
3.783.31   0.25  28.90
Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
vdd   0.0031.000.00  906.50 0.00  5351.50 5.90 
0.370.40   0.28  25.70
Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
vdd   0.00   294.500.00 1148.50 0.00 16620.5014.47 
4.493.91   0.21  24.60
Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
vdd   0.0026.500.00  810.50 0.00  4922.50 6.07 
0.370.45   0.35  28.35
Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
vdd   0.0045.500.00 1022.00 0.00  6117.00 5.99 
0.380.37   0.28  28.15
Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
vdd   0.00   300.000.00 1155.00 0.00 16997.5014.72 
3.583.10   0.21  24.30
Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
vdd   0.0027.000.00  962.50 0.00  6846.50 7.11 
0.440.46   0.35  33.60
Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
vdd   0.00   270.000.00 1249.50 0.00 14400.0011.52 
4.613.69   0.25  31.25
Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
vdd   0.0015.003.00  660.0024.00  4247.00 6.44 
0.380.57   0.45  29.60
Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
vdd   0.0017.00   24.50  592.50   196.00  8039.0013.35 
0.580.94   0.83  51.05
 









在 2014-09-10 08:37:23,"Josh Durgin"  写道:
>On 09/09/2014 07:06 AM, yuelongguang wrote:
>> hi, josh.durgin:
>> i want to know how librbd launch io request.
>> use case:
>> inside vm, i use fio to test rbd-disk's io performance.
>> fio's pramaters are  bs=4k, direct io, qemu cache=none.
>> in this case, if librbd just send what it gets from vm, i mean  no
>> gather/scatter. the rate , io inside vm : io at librbd: io at osd
>> filestore = 1:1:1?
>
>If the rbd image is not a clone, the io issued from the vm's block
>driver will match the io issued by librbd. With caching disabled
>as you have it, the io from the OSDs will be similar, with some
>small amount extra for OSD bookkeeping.
>



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cephfs upon Tiering

2014-09-11 Thread Kenneth Waegeman

Hi all,

I am testing the tiering functionality with cephfs. I used a  
replicated cache with an EC data pool, and a replicated metadata pool  
like this:



ceph osd pool create cache 1024 1024
ceph osd pool set cache size 2
ceph osd pool set cache min_size 1
ceph osd erasure-code-profile set profile11 k=8 m=3 ruleset-failure-domain=osd
ceph osd pool create ecdata 128 128 erasure profile11
ceph osd tier add ecdata cache
ceph osd tier cache-mode cache writeback
ceph osd tier set-overlay ecdata cache
ceph osd pool set cache hit_set_type bloom
ceph osd pool set cache hit_set_count 1
ceph osd pool set cache hit_set_period 3600
ceph osd pool set cache target_max_bytes $((280*1024*1024*1024))
ceph osd pool create metadata 128 128
ceph osd pool set metadata crush_ruleset 1 # SSD root in crushmap
ceph fs new ceph_fs metadata cache  <-- wrong ?

I started testing with this, and this worked, I could write to it with  
cephfs and the cache was flushing to the ecdata pool as expected.
But now I notice I made the fs right upon the cache, instead of the  
underlying data pool. I suppose I should have done this:


ceph fs new ceph_fs metadata ecdata

So my question is: Was this wrong and not doing the things I thought  
it did, or was this somehow handled by ceph and didn't it matter I  
specified the cache instead of the data pool?



Thank you!

Kenneth

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache Pool writing too much on ssds, poor performance?

2014-09-11 Thread Andrei Mikhailovsky
Hi, 

I have created the cache tier using the following commands: 

95 ceph osd pool create cache-pool-ssd 2048 2048 ; ceph osd pool set 
cache-pool-ssd crush_ruleset 4 
124 ceph osd pool set cache-pool-ssd size 2 
126 ceph osd pool set cache-pool-ssd min_size 1 
130 ceph osd tier add Primary-ubuntu-1 cache-pool-ssd 
131 ceph osd tier cache-mode cache-pool-ssd writeback 
132 ceph osd tier set-overlay Primary-ubuntu-1 cache-pool-ssd 
135 ceph osd pool set cache-pool-ssd hit_set_type bloom 
136 ceph osd pool set cache-pool-ssd hit_set_count 1 
137 ceph osd pool set cache-pool-ssd hit_set_period 3600 
138 ceph osd pool set cache-pool-ssd target_max_bytes 5000 
143 ceph osd pool set cache-pool-ssd cache_target_full_ratio 0.8 


SInce the initial install i've increased the target_max_bytes to 800GB. The 
rest of the settings are left as default. 

Did I miss something that might explain the behaviour that i am experiencing? 

Cheers 

Andrei 


- Original Message -

From: "Xiaoxi Chen"  
To: "Andrei Mikhailovsky" , "ceph-users" 
 
Sent: Thursday, 11 September, 2014 2:00:31 AM 
Subject: RE: Cache Pool writing too much on ssds, poor performance? 



Could you show your cache tiering configuration? Especially this three 
parameters. 

ceph osd pool set hot-storage cache_target_dirty_ratio 0.4 ceph osd pool set 
hot-storage cache_target_full_ratio 0.8 ceph osd pool set {cachepool} 
target_max_bytes {#bytes} 




From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Andrei 
Mikhailovsky 
Sent: Wednesday, September 10, 2014 8:51 PM 
To: ceph-users 
Subject: [ceph-users] Cache Pool writing too much on ssds, poor performance? 



Hello guys, 

I am experimeting with cache pool and running some tests to see how adding the 
cache pool improves the overall performance of our small cluster. 

While doing testing I've noticed that it seems that the cache pool is writing 
too much on the cache pool ssds. Not sure what the issue here, perhaps someone 
could help me understand what is going on. 

My test cluster is: 
2 x OSD servers (Each server has: 24GB ram, 12 cores, 8 hdd osds, 2 ssds 
journals, 2 ssds for cache pool, 40gbit/s infiniband network capable of 
25gbit/s over ipoib). Cache pool is set to 500GB with replica of 2. 
4 x host servers (128GB ram, 24 core, 40gbit/s infiniband network capable of 
12gbit/s over ipoib) 

So, my test is: 
Simple tests using the following command: "dd if=/dev/vda of=/dev/null bs=4M 
count=2000 iflag=direct". I am concurrently starting this command on 10 virtual 
machines which are running on 4 host servers. The aim is to monitor the use of 
cache pool when reading the same data over and over again. 


Running the above command for the first time does what I was expecting. The 
osds are doing a lot of reads, the cache pool does a lot of writes (around 
250-300MB/s per ssd disk) and no reads. The dd results for the guest vms are 
poor. The results of the "ceph -w" shows consistent performance across the 
time. 

Running the above for the second and consequent times produces IO patterns 
which I was not expecting at all. The hdd osds are not doing much (this part I 
expected), the cache pool still does a lot of writes and very little reads! The 
dd results have improved just a little, but not much. The results of the "ceph 
-w" shows performance breaks over time. For instance, I have a peak of 
throughput in the first couple of seconds (data is probably coming from the osd 
server's ram at high rate). After the peak throughput has finished, the ceph 
reads are done in the following way: 2-3 seconds of activity followed by 2 
seconds if inactivity) and it keeps doing that throughout the length of the 
test. So, to put the numbers in perspective, when running tests over and over 
again I would get around 2000 - 3000MB/s for the first two seconds, followed by 
0MB/s for the next two seconds, followed by around 150-250MB/s over 2-3 
seconds, followed by 0MB/s for 2 seconds, followed 150-250MB/s over 2-3 
seconds, followed by 0MB/s over 2 secods, and the pattern repeats until the 
test is done. 


I kept running the dd command for about 15-20 times and observed the same 
behariour. The cache pool does mainly writes (around 200MB/s per ssd) when 
guest vms are reading the same data over and over again. There is very little 
read IO (around 20-40MB/s). Why am I not getting high read IO? I have expected 
the 80GB of data that is being read from the vms over and over again to be 
firmly recognised as the hot data and kept in the cache pool and read from it 
when guest vms request the data. Instead, I mainly get writes on the cache pool 
ssds and I am not really sure where these writes are coming from as my hdd osds 
are being pretty idle. 

>From the overall tests so far, introducing the cache pool has drastically 
>slowed down my cluster (by as much as 50-60%). 

Thanks for any help 

Andrei 
___
ceph-users mailing list
c

[ceph-users] Rebalancing slow I/O.

2014-09-11 Thread Irek Fasikhov
Hi,All.

DELL R720X8,96 OSDs, Network 2x10Gbit LACP.

When one of the nodes crashes, I get very slow I / O operations on virtual
machines.
A cluster map by default.
[ceph@ceph08 ~]$ ceph osd tree
# idweight  type name   up/down reweight
-1  262.1   root defaults
-2  32.76   host ceph01
0   2.73osd.0   up  1
...
11  2.73osd.11  up  1
-3  32.76   host ceph02
13  2.73osd.13  up  1
..
12  2.73osd.12  up  1
-4  32.76   host ceph03
24  2.73osd.24  up  1

35  2.73osd.35  up  1
-5  32.76   host ceph04
37  2.73osd.37  up  1
.
47  2.73osd.47  up  1
-6  32.76   host ceph05
48  2.73osd.48  up  1
...
59  2.73osd.59  up  1
-7  32.76   host ceph06
60  2.73osd.60  down0
...
71  2.73osd.71  down0
-8  32.76   host ceph07
72  2.73osd.72  up  1

83  2.73osd.83  up  1
-9  32.76   host ceph08
84  2.73osd.84  up  1

95  2.73osd.95  up  1


If I change the cluster map on the following:
root---|
  |
  |-rack1
  ||
  |host ceph01
  |host ceph02
  |host ceph03
  |host ceph04
  |
  |---rack2
   |
  host ceph05
  host ceph06
  host ceph07
  host ceph08
What will povidenie cluster failover one node? And how much will it affect
the performance?
Thank you

-- 
С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] error while installing ceph in cluster node

2014-09-11 Thread Subhadip Bagui
Hi,

Please let me know what can be the issue.

Regards,
Subhadip
---

On Thu, Sep 11, 2014 at 9:54 AM, Subhadip Bagui  wrote:

> Hi,
>
> I'm getting the below error while installing ceph on node using
> ceph-deploy. I'm executing the command in admin node as
>
> [root@ceph-admin ~]$ ceph-deploy install ceph-mds
>
> [ceph-mds][DEBUG ] Loaded plugins: fastestmirror, security
> [ceph-mds][WARNIN] You need to be root to perform this command.
> [ceph-mds][ERROR ] RuntimeError: command returned non-zero exit status: 1
> [ceph_deploy][ERROR ] RuntimeError: Failed to execute command: yum -y
> install wget
>
> I have changed the Defaults requiretty setting to Defaults:ceph
> !requiretty in /etc/sudoers file and also put ceph as sudo user same as
> root in node ceph-mds. added root privilege on node ceph-mds using
> command--- echo "ceph ALL = (root) NOPASSWD:ALL" | sudo tee /etc/sudoers
> sudo chmod 0440 /etc/sudoers as mentioned in the doc
>
> All servers are on centOS 6.5
>
> Please let me know what can be the issue here?
>
>
> Regards,
> Subhadip
> ---
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Rebalancing slow I/O.

2014-09-11 Thread Andrei Mikhailovsky
Irek, 

have you change the ceph.conf file to change the recovery p riority? 

Options like these might help with prioritising repair/rebuild io with the 
client IO: 

osd_recovery_max_chunk = 8388608 
osd_recovery_op_priority = 2 
osd_max_backfills = 1 
osd_recovery_max_active = 1 
osd_recovery_threads = 1 


Andrei 
- Original Message -

From: "Irek Fasikhov"  
To: ceph-users@lists.ceph.com 
Sent: Thursday, 11 September, 2014 1:07:06 PM 
Subject: [ceph-users] Rebalancing slow I/O. 




Hi,All. 


DELL R720X8,96 OSDs, Network 2x10Gbit LACP. 


When one of the nodes crashes, I get very slow I / O operations on virtual 
machines. 
A cluster map by default. 
[ceph@ceph08 ~]$ ceph osd tree 


# id weight type name up/down reweight 
-1 262.1 root defaults 
-2 32.76 host ceph01 
0 2.73 osd.0 up 1 
... 
11 2.73 osd.11 up 1 
-3 32.76 host ceph02 
13 2.73 osd.13 up 1 
.. 
12 2.73 osd.12 up 1 
-4 32.76 host ceph03 
24 2.73 osd.24 up 1 
 
35 2.73 osd.35 up 1 
-5 32.76 host ceph04 
37 2.73 osd.37 up 1 
. 
47 2.73 osd.47 up 1 
-6 32.76 host ceph05 
48 2.73 osd.48 up 1 
... 
59 2.73 osd.59 up 1 
-7 32.76 host ceph06 
60 2.73 osd.60 down 0 
... 
71 2.73 osd.71 down 0 
-8 32.76 host ceph07 
72 2.73 osd.72 up 1 
 
83 2.73 osd.83 up 1 
-9 32.76 host ceph08 
84 2.73 osd.84 up 1 
 
95 2.73 osd.95 up 1 




If I change the cluster map on the following: 

root---| 
| 
|-rack1 
| | 
| host ceph01 
| host ceph02 
| host ceph03 
| host ceph04 
| 
|---rack2 
| 
host ceph05 
host ceph06 
host ceph07 
host ceph08 
What will povidenie cluster failover one node? And how much will it affect the 
performance? 
Thank you 


-- 

С уважением, Фасихов Ирек Нургаязович 
Моб.: +79229045757 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] (no subject)

2014-09-11 Thread Alfredo Deza
We discourage users from using `root` to call ceph-deploy or to call
it with `sudo` for this reason.

We have a warning in the docs about it if you are getting started in
the Ceph Node Setup section:
http://ceph.com/docs/v0.80.5/start/quick-start-preflight/#ceph-deploy-setup

The reason for this is that if you configure ssh to login to the
remote server as a non-root user (say user "ceph") there is no way for
ceph-deploy to know that you need to call sudo
on the remote server because it detected you were root.

ceph-deploy does this detection to prevent calling sudo if you are
root on the remote server.

So, to fix this situation, where you are executing as root but login
into the remote server as a non-root user you can use either of these
two options:

* don't execute ceph-deploy as root
* don't configure ssh to login as a non-root user

On Thu, Sep 11, 2014 at 12:16 AM, Subhadip Bagui  wrote:
> Hi,
>
> I'm getting the below error while installing ceph on node using ceph-deploy.
> I'm executing the command in admin node as
>
> [root@ceph-admin ~]$ ceph-deploy install ceph-mds
>
> [ceph-mds][DEBUG ] Loaded plugins: fastestmirror, security
> [ceph-mds][WARNIN] You need to be root to perform this command.
> [ceph-mds][ERROR ] RuntimeError: command returned non-zero exit status: 1
> [ceph_deploy][ERROR ] RuntimeError: Failed to execute command: yum -y
> install wget
>
> I have changed the Defaults requiretty setting to Defaults:ceph !requiretty
> in /etc/sudoers file and also put ceph as sudo user same as root in node
> ceph-mds. added root privilege on node ceph-mds using command--- echo "ceph
> ALL = (root) NOPASSWD:ALL" | sudo tee /etc/sudoers sudo chmod 0440
> /etc/sudoers as mentioned in the doc
>
> All servers are on centOS 6.5
>
> Please let me know what can be the issue here?
>
>
> Regards,
> Subhadip
> ---
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Striping with cloned images

2014-09-11 Thread Gerhard Wolkerstorfer
Hi,

i am running a Ceph cluster that contains the following RBD image: 

root@ceph0:~# rbd info -p cephstorage debian_6_0_9_template_system
rbd image 'debian_6_0_9_template_system':
size 30720 MB in 7680 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.1907c2ae8944a
format: 2
features: layering, striping
stripe unit: 4096 kB
stripe count: 1

If I make a flattened clone of the image (create a snapshot - protect the 
snapshot - clone - flatten the clone) the output of rbd image is:

root@ceph0:~# rbd info -p cephstorage debian_6_0_9_system
rbd image 'debian_6_0_9_system':
size 30720 MB in 7680 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.199bc2eb141f2
format: 2
features: layering

Why is the striping feature missing in the clone? Is there a way to enable it 
on the cloned image? Does it even matter?

Best
Gerhard



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache Pool writing too much on ssds, poor performance?

2014-09-11 Thread Mark Nelson
Something that is very important to keep in mind with the way that the 
cache tier implementation currently works in Ceph is that cache misses 
are very expensive.  It's really important that your workload have a 
really big hot/cold data skew otherwise it's not going to work well at 
all.  In your case, you are doing sequential reads which is terrible for 
this because for each pass you are never re-reading the same blocks, and 
by the time you get to the end of the test and restart it, the first 
blocks (apparently) have already been flushed.  If you increased the 
size of the cache tier, you might be able to fit the whole thing in 
cache which should help dramatically, but that's not easy to guarantee 
outside of benchmarks.


I'm guessing you are using firefly right?  To improve this behaviour, 
Sage implemented a new policy in the recent development releases not to 
promote reads right away.  Instead, we wait to promote until there are 
several reads of the same object within a certain time period.  That 
should dramatically help in this case.  You really don't want big 
sequential reads being promoted into cache since cache promotion is 
expensive and the disks are really good at doing that kind of thing anyway.


On the flip side, 4MB read misses are bad, but the situation is even 
worse with 4K misses. Imagine for instance that you are going to do a 4K 
read from a default 4MB RBD block and that object is not in cache.  In 
the implementation we have in firefly, the whole 4MB object will be 
promoted to the cache which will in most cases require a transfer of 
that object over the network to the primary OSD for the associated PG in 
the cache pool.  Now depending on the replication policy, that primary 
cache pool OSD will fan out and write (by default) 2 extra copies of the 
data to the other OSDs in the PG, so 3 total.  Now assuming your cache 
tier is on SSDs with co-located journals, each one of those writes will 
actually be 2 writes, one to the journal, and one to the data store.


To recap: *Any* read miss regardless if it's 4KB or 4MB means at least 1 
4MB object promotion, times 3 replicas (ie 12MB over the network) times 
2 for the journal writes. So 24MB of data written to the cache tier 
disks, no matter if it's 4KB or 4MB.  Imagine you have 200 IOPS worth of 
4KB read cache misses.  That's roughly 4.8GB/s of writes into the cache 
tier.  If you are seldomly re-reading the same blocks, performance will 
be absolutely terrible.  On the other hand, if you have lots of small 
random reads from the same set of 4MB objects, the cache tier really can 
help.  How much it helps vs just doing the reads from page cache is 
debatable though.  There's some band between page cache and disk where 
the cache tier fits in, but getting everything right is going to be tricky.


The optimal situation imho is that the cache tier only promote objects 
that have a lot of small random reads hitting them, and be very 
conservative about promotions, especially for new writes.  I don't know 
whether or not cache promotion might pollute page cache in strange ways, 
but that's something we also may need to be careful about.


For more info, see the following thread:

http://www.spinics.net/lists/ceph-devel/msg20189.html

Mark

On 09/10/2014 07:51 AM, Andrei Mikhailovsky wrote:

Hello guys,

I am experimeting with cache pool and running some tests to see how
adding the cache pool improves the overall performance of our small cluster.

While doing testing I've noticed that it seems that the cache pool is
writing too much on the cache pool ssds. Not sure what the issue here,
perhaps someone could help me understand what is going on.

My test cluster is:
2 x OSD servers (Each server has: 24GB ram, 12 cores, 8 hdd osds, 2 ssds
journals, 2 ssds for cache pool, 40gbit/s infiniband network capable of
25gbit/s over ipoib). Cache pool is set to 500GB with replica of 2.
4 x host servers (128GB ram, 24 core, 40gbit/s infiniband network
capable of 12gbit/s over ipoib)

So, my test is:
Simple tests using the following command: "dd if=/dev/vda of=/dev/null
bs=4M count=2000 iflag=direct". I am concurrently starting this command
on 10 virtual machines which are running on 4 host servers. The aim is
to monitor the use of cache pool when reading the same data over and
over again.


Running the above command for the first time does what I was expecting.
The osds are doing a lot of reads, the cache pool does a lot of writes
(around 250-300MB/s per ssd disk) and no reads. The dd results for the
guest vms are poor. The results of the "ceph -w" shows consistent
performance across the time.

Running the above for the second and consequent times produces IO
patterns which I was not expecting at all. The hdd osds are not doing
much (this part I expected), the cache pool still does a lot of writes
and very little reads! The dd results have improved just a little, but
not much. The results of the "ceph -w" shows performance breaks over
time. 

Re: [ceph-users] Cache Pool writing too much on ssds, poor performance?

2014-09-11 Thread Andrei Mikhailovsky
Mark, 

Thanks for a very detailed email. Really apreciate your help on this. I now 
have a bit more understanding on how it works and understand why I am getting 
so much write on the cache ssds. 

I am however, trouble to understand why the cache pool is not keeping the data 
and flushing it? I've got the pool size of about 7x as large as the current 
benchmark set (80gb data set vs 500GB pool size), so the benchmark data should 
fit nicely many times over. I understand if there is a small percentage of the 
data is a cache miss, but from what it looks like it is missing a considerable 
amount. 

Is there a way to check the stats of the cache pool, including hit/miss 
information and other data? 

Yes, I am using firefly 0.80.5. 

Thanks 

Andrei 

- Original Message -

From: "Mark Nelson"  
To: ceph-users@lists.ceph.com 
Sent: Thursday, 11 September, 2014 3:02:40 PM 
Subject: Re: [ceph-users] Cache Pool writing too much on ssds, poor 
performance? 

Something that is very important to keep in mind with the way that the 
cache tier implementation currently works in Ceph is that cache misses 
are very expensive. It's really important that your workload have a 
really big hot/cold data skew otherwise it's not going to work well at 
all. In your case, you are doing sequential reads which is terrible for 
this because for each pass you are never re-reading the same blocks, and 
by the time you get to the end of the test and restart it, the first 
blocks (apparently) have already been flushed. If you increased the 
size of the cache tier, you might be able to fit the whole thing in 
cache which should help dramatically, but that's not easy to guarantee 
outside of benchmarks. 

I'm guessing you are using firefly right? To improve this behaviour, 
Sage implemented a new policy in the recent development releases not to 
promote reads right away. Instead, we wait to promote until there are 
several reads of the same object within a certain time period. That 
should dramatically help in this case. You really don't want big 
sequential reads being promoted into cache since cache promotion is 
expensive and the disks are really good at doing that kind of thing anyway. 

On the flip side, 4MB read misses are bad, but the situation is even 
worse with 4K misses. Imagine for instance that you are going to do a 4K 
read from a default 4MB RBD block and that object is not in cache. In 
the implementation we have in firefly, the whole 4MB object will be 
promoted to the cache which will in most cases require a transfer of 
that object over the network to the primary OSD for the associated PG in 
the cache pool. Now depending on the replication policy, that primary 
cache pool OSD will fan out and write (by default) 2 extra copies of the 
data to the other OSDs in the PG, so 3 total. Now assuming your cache 
tier is on SSDs with co-located journals, each one of those writes will 
actually be 2 writes, one to the journal, and one to the data store. 

To recap: *Any* read miss regardless if it's 4KB or 4MB means at least 1 
4MB object promotion, times 3 replicas (ie 12MB over the network) times 
2 for the journal writes. So 24MB of data written to the cache tier 
disks, no matter if it's 4KB or 4MB. Imagine you have 200 IOPS worth of 
4KB read cache misses. That's roughly 4.8GB/s of writes into the cache 
tier. If you are seldomly re-reading the same blocks, performance will 
be absolutely terrible. On the other hand, if you have lots of small 
random reads from the same set of 4MB objects, the cache tier really can 
help. How much it helps vs just doing the reads from page cache is 
debatable though. There's some band between page cache and disk where 
the cache tier fits in, but getting everything right is going to be tricky. 

The optimal situation imho is that the cache tier only promote objects 
that have a lot of small random reads hitting them, and be very 
conservative about promotions, especially for new writes. I don't know 
whether or not cache promotion might pollute page cache in strange ways, 
but that's something we also may need to be careful about. 

For more info, see the following thread: 

http://www.spinics.net/lists/ceph-devel/msg20189.html 

Mark 

On 09/10/2014 07:51 AM, Andrei Mikhailovsky wrote: 
> Hello guys, 
> 
> I am experimeting with cache pool and running some tests to see how 
> adding the cache pool improves the overall performance of our small cluster. 
> 
> While doing testing I've noticed that it seems that the cache pool is 
> writing too much on the cache pool ssds. Not sure what the issue here, 
> perhaps someone could help me understand what is going on. 
> 
> My test cluster is: 
> 2 x OSD servers (Each server has: 24GB ram, 12 cores, 8 hdd osds, 2 ssds 
> journals, 2 ssds for cache pool, 40gbit/s infiniband network capable of 
> 25gbit/s over ipoib). Cache pool is set to 500GB with replica of 2. 
> 4 x host servers (128GB ram, 24 core, 40gbit/s infiniband network 
> capable

Re: [ceph-users] Cache Pool writing too much on ssds, poor performance?

2014-09-11 Thread Mark Nelson

I'd take a look at:

http://ceph.com/docs/master/rados/operations/pools/

and see if any of the options that govern cache flush behaviour may be 
affecting things.  Specifically:


cache_target_dirty_ratio
cache_target_full_ratio
target_max_bytes
target_max_objects
cache_min_flush_age
cache_min_evict_age

If it's not any of those, it might be worth flushing cache and then 
seeing how quickly writes are happening vs reads when the cache is empty 
and where you end up at the end of 1 iteration of the test.  Perhaps 
there is something unusual going on.


Mark

On 09/11/2014 09:26 AM, Andrei Mikhailovsky wrote:

Mark,

Thanks for a very detailed email. Really apreciate your help on this. I
now have a bit more understanding on how it works and understand why I
am getting so much write on the cache ssds.

I am however, trouble to understand why the cache pool is not keeping
the data and flushing it? I've got the pool size of about 7x as large as
the current benchmark set (80gb data set vs 500GB pool size), so the
benchmark data should fit nicely many times over. I understand if there
is a small percentage of the data is a cache miss, but from what it
looks like it is missing a considerable amount.

Is there a way to check the stats of the cache pool, including hit/miss
information and other data?

Yes, I am using firefly 0.80.5.

Thanks

Andrei


*From: *"Mark Nelson" 
*To: *ceph-users@lists.ceph.com
*Sent: *Thursday, 11 September, 2014 3:02:40 PM
*Subject: *Re: [ceph-users] Cache Pool writing too much on
ssds,poor performance?

Something that is very important to keep in mind with the way that the
cache tier implementation currently works in Ceph is that cache misses
are very expensive.  It's really important that your workload have a
really big hot/cold data skew otherwise it's not going to work well at
all.  In your case, you are doing sequential reads which is terrible for
this because for each pass you are never re-reading the same blocks, and
by the time you get to the end of the test and restart it, the first
blocks (apparently) have already been flushed.  If you increased the
size of the cache tier, you might be able to fit the whole thing in
cache which should help dramatically, but that's not easy to guarantee
outside of benchmarks.

I'm guessing you are using firefly right?  To improve this behaviour,
Sage implemented a new policy in the recent development releases not to
promote reads right away.  Instead, we wait to promote until there are
several reads of the same object within a certain time period.  That
should dramatically help in this case.  You really don't want big
sequential reads being promoted into cache since cache promotion is
expensive and the disks are really good at doing that kind of thing anyway.

On the flip side, 4MB read misses are bad, but the situation is even
worse with 4K misses. Imagine for instance that you are going to do a 4K
read from a default 4MB RBD block and that object is not in cache.  In
the implementation we have in firefly, the whole 4MB object will be
promoted to the cache which will in most cases require a transfer of
that object over the network to the primary OSD for the associated PG in
the cache pool.  Now depending on the replication policy, that primary
cache pool OSD will fan out and write (by default) 2 extra copies of the
data to the other OSDs in the PG, so 3 total.  Now assuming your cache
tier is on SSDs with co-located journals, each one of those writes will
actually be 2 writes, one to the journal, and one to the data store.

To recap: *Any* read miss regardless if it's 4KB or 4MB means at least 1
4MB object promotion, times 3 replicas (ie 12MB over the network) times
2 for the journal writes. So 24MB of data written to the cache tier
disks, no matter if it's 4KB or 4MB.  Imagine you have 200 IOPS worth of
4KB read cache misses.  That's roughly 4.8GB/s of writes into the cache
tier.  If you are seldomly re-reading the same blocks, performance will
be absolutely terrible.  On the other hand, if you have lots of small
random reads from the same set of 4MB objects, the cache tier really can
help.  How much it helps vs just doing the reads from page cache is
debatable though.  There's some band between page cache and disk where
the cache tier fits in, but getting everything right is going to be tricky.

The optimal situation imho is that the cache tier only promote objects
that have a lot of small random reads hitting them, and be very
conservative about promotions, especially for new writes.  I don't know
whether or not cache promotion might pollute page cache in strange ways,
but that's something we also may need to be careful about.

For more info, see the following thread:

http://www.spinics.net/lists/ceph-devel/msg20189.html

Mark

On 09/10/2014 07:51 AM, Andrei Mikhailovsky wrote:
 > Hello guys,
 >
 > I am experimeting with cache pool and running some tests to se

Re: [ceph-users] osd cpu usage is bigger than 100%

2014-09-11 Thread Gregory Farnum
Presumably it's going faster when you have a deeper iodepth? So the reason
it's using more CPU is because it's doing more work. That's all there is to
it. (And the OSD uses a lot more CPU than some storage systems do, because
it does a lot more work than them.)
-Greg

On Thursday, September 11, 2014, yuelongguang  wrote:

> hi,all
> i am testing   rbd performance, now there is only one vm which is using
> rbd as its disk, and inside it  fio is doing r/w.
> the big diffenence is that i set a big iodepth other than iodepth=1.
> according to my test,  the bigger iodepth, the bigger cpu usage.
>
> analyse  the output of top command.
> 1.
> 12% wa,  if it means disk speed is not fast enough?
>
> 2. from where  we  can know  whether ceph's number of threads  is  enough
> or not?
>
>
> how do you think about it,  which part is using up cpu? i want to find the
> root cause, why big iodepth leads to high cpu usage.
>
>
> ---default options
> osd_op_threads": "2",
>   "osd_disk_threads": "1",
>   "osd_recovery_threads": "1",
> "filestore_op_threads": "2",
>
>
> thanks
>
> --top---iodepth=16-
> top - 15:27:34 up 2 days,  6:03,  2 users,  load average: 0.49, 0.56, 0.62
> Tasks:  97 total,   1 running,  96 sleeping,   0 stopped,   0 zombie
> Cpu(s): 19.0%us,  8.1%sy,  0.0%ni, 59.3%id, 12.1%wa,  0.0%hi,  0.8%si,
> 0.7%st
> Mem:   1922540k total,  1853180k used,69360k free, 7012k buffers
> Swap:  1048568k total,76796k used,   971772k free,  1034272k cached
>   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+
> COMMAND
>
>  2763 root  20   0 1112m 386m 5028 S 60.8 20.6 200:43.47 ceph-osd
>
>  -top
> top - 19:50:08 up 1 day, 10:26,  2 users,  load average: 1.55, 0.97, 0.81
> Tasks:  97 total,   1 running,  96 sleeping,   0 stopped,   0 zombie
> Cpu(s): 37.6%us, 14.2%sy,  0.0%ni, 37.0%id,  9.4%wa,  0.0%hi,  1.3%si,
> 0.5%st
> Mem:   1922540k total,  1820196k used,   102344k free,23100k buffers
> Swap:  1048568k total,91724k used,   956844k free,  1052292k cached
>
>   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+
> COMMAND
>
>  4312 root  20   0 1100m 337m 5192 S 107.3 18.0  88:33.27
> ceph-osd
>
>  1704 root  20   0  514m 272m 3648 S  0.7 14.5   3:27.19 ceph-mon
>
>
>
> --iostat--
>
> Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz
> avgqu-sz   await  svctm  %util
> vdd   5.50   137.50  247.00  782.00  2896.00  8773.00
> 11.34 7.083.55   0.63  65.05
>
> Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz
> avgqu-sz   await  svctm  %util
> vdd   9.50   119.00  327.50  458.50  3940.00  4733.50
> 11.0312.03   19.66   0.70  55.40
>
> Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz
> avgqu-sz   await  svctm  %util
> vdd  15.5010.50  324.00  559.50  3784.00  3398.00
> 8.13 1.982.22   0.81  71.25
>
> Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz
> avgqu-sz   await  svctm  %util
> vdd   4.50   253.50  273.50  803.00  3056.00 12155.00
> 14.13 4.704.32   0.55  59.55
>
> Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz
> avgqu-sz   await  svctm  %util
> vdd  10.00 6.00  294.00  488.00  3200.00  2933.50
> 7.84 1.101.49   0.70  54.85
>
> Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz
> avgqu-sz   await  svctm  %util
> vdd  10.0014.00  333.00  645.00  3780.00  3846.00
> 7.80 2.132.15   0.90  87.55
>
> Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz
> avgqu-sz   await  svctm  %util
> vdd  11.00   240.50  259.00  579.00  3144.00 10035.50
> 15.73 8.51   10.18   0.84  70.20
>
> Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz
> avgqu-sz   await  svctm  %util
> vdd  10.5017.00  318.50  707.00  3876.00  4084.50
> 7.76 1.321.30   0.61  62.65
>
> Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz
> avgqu-sz   await  svctm  %util
> vdd   4.50   208.00  233.50  918.00  2648.00 19214.50
> 18.99 5.434.71   0.55  63.20
>
> Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz
> avgqu-sz   await  svctm  %util
> vdd   7.00 1.50  306.00  212.00  3376.00  2176.50
> 10.72 1.031.83   0.96  49.70
>
>
>
>
>

-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] why one osd-op from client can get two osd-op-reply?

2014-09-11 Thread Gregory Farnum
It's the recovery and backfill code. There's not one place; it's what most
of the OSD code is for.

On Thursday, September 11, 2014, yuelongguang  wrote:

> as for the second question, could you tell me where the code is.
> how ceph makes size/min_szie copies?
>
> thanks
>
>
>
>
>
>
>
> At 2014-09-11 12:19:18, "Gregory Farnum"  > wrote:
> >On Wed, Sep 10, 2014 at 8:29 PM, yuelongguang  >> wrote:
> >>
> >>
> >>
> >>
> >> as for ack and ondisk, ceph has size and min_size to decide there are  how
> >> many replications.
> >> if client receive ack or ondisk, which means there are at least min_size
> >> osds  have  done the ops?
> >>
> >> i am reading the cource code, could you help me with the two questions.
> >>
> >> 1.
> >>  on osd, where is the code that reply ops  separately  according to ack or
> >> ondisk.
> >>  i check the code, but i thought they always are replied together.
> >
> >It depends on what journaling mode you're in, but generally they're
> >triggered separately (unless it goes on disk first, in which case it
> >will skip the ack — this is the mode it uses for non-btrfs
> >filesystems). The places where it actually replies are pretty clear
> >about doing one or the other, though...
> >
> >>
> >> 2.
> >>  now i just know how client write ops to primary osd, inside osd cluster,
> >> how it promises min_size copy are reached.
> >> i mean  when primary osd receives ops , how it spreads ops to others, and
> >> how it processes other's reply.
> >
> >That's not how it works. The primary for a PG will not go "active"
> >with it until it has at least min_size copies that it knows about.
> >Once the OSD is doing any processing of the PG, it requires all
> >participating members to respond before it sends any messages back to
> >the client.
> >-Greg
> >Software Engineer #42 @ http://inktank.com | http://ceph.com
> >
> >>
> >>
> >> greg, thanks very much
> >>
> >>
> >>
> >>
> >>
> >> 在 2014-09-11 01:36:39,"Gregory Farnum"  >> > 写道:
> >>
> >> The important bit there is actually near the end of the message output 
> >> line,
> >> where the first says "ack" and the second says "ondisk".
> >>
> >> I assume you're using btrfs; the ack is returned after the write is applied
> >> in-memory and readable by clients. The ondisk (commit) message is returned
> >> after it's durable to the journal or the backing filesystem.
> >> -Greg
> >>
> >> On Wednesday, September 10, 2014, yuelongguang  >> > wrote:
> >>>
> >>> hi,all
> >>> i recently debug ceph rbd, the log tells that  one write to osd can get
> >>> two if its reply.
> >>> the difference between them is seq.
> >>> why?
> >>>
> >>> thanks
> >>> ---log-
> >>> reader got message 6 0x7f58900010a0 osd_op_reply(15
> >>> rbd_data.19d92ae8944a.0001 [set-alloc-hint object_size 4194304
> >>> write_size 4194304,write 0~3145728] v211'518 uv518 ack = 0) v6
> >>> 2014-09-10 08:47:32.348213 7f58bc16b700 20 -- 10.58.100.92:0/1047669 queue
> >>> 0x7f58900010a0 prio 127
> >>> 2014-09-10 08:47:32.348230 7f58bc16b700 20 -- 10.58.100.92:0/1047669 >>
> >>> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
> >>> c=0xfae940).reader reading tag...
> >>> 2014-09-10 08:47:32.348245 7f58bc16b700 20 -- 10.58.100.92:0/1047669 >>
> >>> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
> >>> c=0xfae940).reader got MSG
> >>> 2014-09-10 08:47:32.348257 7f58bc16b700 20 -- 10.58.100.92:0/1047669 >>
> >>> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
> >>> c=0xfae940).reader got envelope type=43 src osd.1 front=247 data=0 off 0
> >>> 2014-09-10 08:47:32.348269 7f58bc16b700 10 -- 10.58.100.92:0/1047669 >>
> >>> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
> >>> c=0xfae940).reader wants 247 from dispatch throttler 247/104857600
> >>> 2014-09-10 08:47:32.348286 7f58bc16b700 20 -- 10.58.100.92:0/1047669 >>
> >>> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
> >>> c=0xfae940).reader got front 247
> >>> 2014-09-10 08:47:32.348303 7f58bc16b700 10 -- 10.58.100.92:0/1047669 >>
> >>> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
> >>> c=0xfae940).aborted = 0
> >>> 2014-09-10 08:47:32.348312 7f58bc16b700 20 -- 10.58.100.92:0/1047669 >>
> >>> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
> >>> c=0xfae940).reader got 247 + 0 + 0 byte message
> >>> 2014-09-10 08:47:32.348332 7f58bc16b700 10 check_message_signature: seq #
> >>> = 7 front_crc_ = 3699418201 middle_crc = 0 data_crc = 0
> >>> 2014-09-10 08:47:32.348369 7f58bc16b700 10 -- 10.58.100.92:0/1047669 >>
> >>> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
> >>> c=0xfae940).reader got message 7 0x7f5890003660 osd_op_reply(15
> >>> rbd_data.19d92ae8944a.0001 [set-alloc-hint object_size 4194304
> >>> write_size 4194304,write 0~3145728] v211'518 uv518 ondisk = 0) v6
> >>>
> >>>
> >>
> >>
> >> --
> >> Software Engineer #42 @ http://inktank.com | http

Re: [ceph-users] Is ceph osd reweight always safe to use?

2014-09-11 Thread JR
Greetings

Just a follow up on the resolution of this issue.

Restarting ceph-osd on one of the nodes solved the problem of the
stuck unclean pgs.

Thanks,
JR

On 9/9/2014 2:24 AM, Christian Balzer wrote:
> 
> Hello,
> 
> On Tue, 09 Sep 2014 01:25:17 -0400 JR wrote:
> 
>> Greetings
>> 
>> After running for a couple of hours, my attempt to re-balance a 
>> near ful disk has stopped with a stuck unclean error:
>> 
> Which is exactly what I warned you about below and what you should 
> have also taken away from fully reading the "Uneven OSD usage" 
> thread.
> 
> This also should hammer my previous point about your current 
> cluster size/utilization home. Even with a better (don't expect 
> perfect) data distribution, loss of one node might well find you 
> with a full OSD again.
> 
>> root@osd45:~# ceph -s cluster 
>> c8122868-27af-11e4-b570-52540004010f health HEALTH_WARN 6 pgs 
>> backfilling; 6 pgs stuck unclean; recovery 13086/1158268
>> degraded (1.130%) monmap e1: 3 mons at 
>> {osd42=10.7.7.142:6789/0,osd43=10.7.7.143:6789/0,osd45=10.7.7.145:6789/0},
>>
>>
>> 
election epoch 80, quorum 0,1,2 osd42,osd43,osd45
>> osdmap e723: 8 osds: 8 up, 8 in pgmap v543113: 640 pgs: 634 
>> active+clean, 6 active+remapped+backfilling;  GB data, 2239 
>> GB used, 1295 GB / 3535 GB avail; 8268B/s wr, 0op/s; 
>> 13086/1158268 degraded (1.130%) mdsmap e63: 1/1/1 up 
>> {0=osd42=up:active}, 3 up:standby
>> 
> From what I've read in the past the way forward here is to
> increase the full ratio setting so it can finish the recovery. Or
> add more OSDs, at least temporarily. See: 
> http://ceph.com/docs/master/rados/configuration/mon-config-ref/#storage-capacity
>
>
> 
Read that and apply that knowledge to your cluster, I personally
> wouldn't deploy it in this state.
> 
> Once the recovery is finished I'd proceed cautiously, see below.
> 
>> 
>> The sequence of events today that led to this were:
>> 
>> # starting state: pg_num/pgp_num == 64 ceph osd pool set rbd 
>> pg_num 128 ceph osd pool set rbd pgp_num 128 # there was a 
>> warning thrown up (which I've lost) and which left pgg_num == 64
>>  # nothing happens since pgp_num was inadvertently not raised
>> ceph osd reweight-by-utilization # data moves from one osd on a
>> host to another osd on same host ceph osd reweight  7 1 # data
>> moves back to roughly what it had been
> Never mind the the lack of PGs to play with, manually lowering the 
> weight of the fullest OSD (in small steps) at this time might have 
> given you at least a more level playing field.
> 
>> ceph osd pool set volumes pg_num 192 ceph osd pool set volumes 
>> pgp_num 192 # data moves successfully
> This would have been the time to check what actually happened and 
> if things improved or not (just adding PGs/PGPs might not be 
> enough) and again to manually reweight overly full OSDs.
> 
>> ceph osd pool set rbd pg_num 192 ceph osd pool set rbd pgp_num 
>> 192 # data stuck
>> 
> Baby steps. As in, applying the rise to 128 PGPs first. But I
> guess you would have run into the full OSD either way w/o
> reweighting things between steps.
> 
>> googling (nowadays known as research) reveals that these might be
>> helpful:
>> 
>> - ceph osd crush tunables optimal
> Yes, this might help. Not sure if that works with dumpling, but as 
> I already mentioned dumpling doesn't support "chooseleaf_vary_r". 
> And hashspool. And while the data movement caused by this probably 
> will result in a better balanced cluster (again, with too little 
> PGs it will still do poorly), in the process of getting there it 
> might still run into a full OSD scenario.
> 
>> - setting crush weights to 1
>> 
> Dunno about then one, my crush weights were 1 when I deployed 
> things manually for the first time, the size of the OSD for the
> 2nd manual deployment and ceph-deploy also uses the OSD size in
> TB.
> 
> Christian
> 
>> I resist doing anything for now in the hopes that someone has 
>> something coherent to say (Christian? ;-)
>> 
>> Thanks JR
>> 
>> 
>> On 9/8/2014 10:37 PM, JR wrote:
>>> Hi Christian,
>>> 
>>> Ha ...
>>> 
>>> root@osd45:~# ceph osd pool get rbd pg_num pg_num: 128 
>>> root@osd45:~# ceph osd pool get rbd pgp_num pgp_num: 64
>>> 
>>> That's the explanation!  I did run the command but it spit out 
>>> some (what I thought was a harmless) warning; should have 
>>> checked more carefully.
>>> 
>>> I now have the expected data movement.
>>> 
>>> Thanks alot! JR
>>> 
>>> On 9/8/2014 10:04 PM, Christian Balzer wrote:
 
 Hello,
 
 On Mon, 08 Sep 2014 18:30:07 -0400 JR wrote:
 
> Hi Christian, all,
> 
> Having researched this a bit more, it seemed that just 
> doing
> 
> ceph osd pool set rbd pg_num 128 ceph osd pool set rbd 
> pgp_num 128
> 
> might be the answer.  Alas, it was not. After running the 
> above the cluster just sat there.
> 
 Really now? No data movement, no health warnings during that 
 in the logs, no other error in the l

Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS

2014-09-11 Thread Cedric Lemarchand

Le 11/09/2014 08:20, Alexandre DERUMIER a écrit :
> Hi Sebastien,
>
> here my first results with crucial m550 (I'll send result with intel s3500 
> later):
>
> - 3 nodes
> - dell r620 without expander backplane
> - sas controller : lsi LSI 9207 (no hardware raid or cache)
> - 2 x E5-2603v2 1.8GHz (4cores)
> - 32GB ram
> - network : 2xgigabit link lacp + 2xgigabit lacp for cluster replication.
>
> -os : debian wheezy, with kernel 3.10
>
> os + ceph mon : 2x intel s3500 100gb  linux soft raid
> osd : crucial m550 (1TB).
>
>
> 3mon in the ceph cluster,
> and 1 osd (journal and datas on same disk)
>
>
> ceph.conf 
> -
>   debug_lockdep = 0/0
>   debug_context = 0/0
>   debug_crush = 0/0
>   debug_buffer = 0/0
>   debug_timer = 0/0
>   debug_filer = 0/0
>   debug_objecter = 0/0
>   debug_rados = 0/0
>   debug_rbd = 0/0
>   debug_journaler = 0/0
>   debug_objectcatcher = 0/0
>   debug_client = 0/0
>   debug_osd = 0/0
>   debug_optracker = 0/0
>   debug_objclass = 0/0
>   debug_filestore = 0/0
>   debug_journal = 0/0
>   debug_ms = 0/0
>   debug_monc = 0/0
>   debug_tp = 0/0
>   debug_auth = 0/0
>   debug_finisher = 0/0
>   debug_heartbeatmap = 0/0
>   debug_perfcounter = 0/0
>   debug_asok = 0/0
>   debug_throttle = 0/0
>   debug_mon = 0/0
>   debug_paxos = 0/0
>   debug_rgw = 0/0
>   osd_op_threads = 5
>   filestore_op_threads = 4
>
>  ms_nocrc = true
>  cephx sign messages = false
>  cephx require signatures = false
>
>  ms_dispatch_throttle_bytes = 0
>
>  #0.85
>  throttler_perf_counter = false
>  filestore_fd_cache_size = 64
>  filestore_fd_cache_shards = 32
>  osd_op_num_threads_per_shard = 1
>  osd_op_num_shards = 25
>  osd_enable_op_tracker = true
>
>
>
> Fio disk 4K benchmark
> --
> rand read 4k : fio --filename=/dev/sdb --direct=1 --rw=randread --bs=4k 
> --iodepth=32 --group_reporting --invalidate=0 --name=abc --ioengine=aio
> bw=271755KB/s, iops=67938 
>
> rand write 4k : fio --filename=/dev/sdb --direct=1 --rw=randwrite --bs=4k 
> --iodepth=32 --group_reporting --invalidate=0 --name=abc --ioengine=aio
> bw=228293KB/s, iops=57073
>
>
>
> fio osd benchmark (through librbd)
> --
> [global]
> ioengine=rbd
> clientname=admin
> pool=test
> rbdname=test
> invalidate=0# mandatory
> rw=randwrite
> rw=randread
> bs=4k
> direct=1
> numjobs=4
> group_reporting=1
>
> [rbd_iodepth32]
> iodepth=32
>
>
>
> FIREFLY RESULTS
> 
> fio randwrite : bw=5009.6KB/s, iops=1252
>
> fio randread: bw=37820KB/s, iops=9455
>
>
>
> O.85 RESULTS
> 
>
> fio randwrite : bw=11658KB/s, iops=2914
>
> fio randread : bw=38642KB/s, iops=9660
>
>
>
> 0.85 + osd_enable_op_tracker=false
> ---
> fio randwrite : bw=11630KB/s, iops=2907
> fio randread : bw=80606KB/s, iops=20151,   (cpu 100% - GREAT !)
>
>
>
> So, for read, seem that osd_enable_op_tracker is the bottleneck.
>
>
> Now for write, I really don't understand why it's so low.
>
>
> I have done some iostat:
>
>
> FIO directly on /dev/sdb
> bw=228293KB/s, iops=57073
>
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sdb   0,00 0,000,00 63613,00 0,00 254452,00 8,00  
>   31,240,490,000,49   0,02 100,00
>
>
> FIO directly on osd through librbd
> bw=11658KB/s, iops=2914
>
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sdb   0,00   355,000,00 5225,00 0,00 29678,0011,36
> 57,63   11,030,00   11,03   0,19  99,70
>
>
> (I don't understand what exactly is %util, 100% in the 2 cases, because 10x 
> slower with ceph)
It would be interesting if you could catch the size of writes on SSD
during the bench through librbd (I know nmon can do that)
>
> It could be a dsync problem, result seem pretty poor
>
> # dd if=rand.file of=/dev/sdb bs=4k count=65536 oflag=direct
> 65536+0 enregistrements lus
> 65536+0 enregistrements écrits
> 268435456 octets (268 MB) copiés, 2,77433 s, 96,8 MB/s
>
>
> # dd if=rand.file of=/dev/sdb bs=4k count=65536 oflag=dsync,direct
> ^C17228+0 enregistrements lus
> 17228+0 enregistrements écrits
> 70565888 octets (71 MB) copiés, 70,4098 s, 1,0 MB/s
>
>
>
> I'll do tests with intel s3500 tomorrow to compare
>
> - Mail original - 
>
> De: "Sebastien Han"  
> À: "Warren Wang"  
> Cc: ceph-users@lists.ceph.com 
> Envoyé: Lundi 8 Septembre 2014 22:58:25 
> Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K 
> IOPS 
>
> They definitely are Warren! 
>
> Thanks for bringing thi

[ceph-users] radosgw user creation in secondary site error

2014-09-11 Thread Santhosh Fernandes
Hi All,

When I try to create user in federated gateway secondary site I get
following error.

 radosgw-admin user create --uid="eu-east" --display-name="Region-EU
Zone-East" --name client.radosgw.eu-east-1 --system
2014-09-11 22:34:50.234269 7f3da41327c0 -1 ERROR: region map does not
specify master region
couldn't init storage provider

same commands on master site works fine.

Any one help me to resolve this issue ?

Regards,
Santhosh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OpTracker optimization

2014-09-11 Thread Somnath Roy
Sam/Sage,
I have addressed all of your comments and pushed the changes to the same pull 
request.

https://github.com/ceph/ceph/pull/2440

Thanks & Regards
Somnath

-Original Message-
From: Sage Weil [mailto:sw...@redhat.com] 
Sent: Wednesday, September 10, 2014 8:33 PM
To: Somnath Roy
Cc: Samuel Just; ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com
Subject: RE: OpTracker optimization

I had two substantiative comments on the first patch and then some trivial 
whitespace nits.Otherwise looks good!

tahnks-
sage

On Thu, 11 Sep 2014, Somnath Roy wrote:

> Sam/Sage,
> I have incorporated all of your comments. Please have a look at the same pull 
> request.
> 
> https://github.com/ceph/ceph/pull/2440
> 
> Thanks & Regards
> Somnath
> 
> -Original Message-
> From: Samuel Just [mailto:sam.j...@inktank.com]
> Sent: Wednesday, September 10, 2014 3:25 PM
> To: Somnath Roy
> Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org; 
> ceph-users@lists.ceph.com
> Subject: Re: OpTracker optimization
> 
> Oh, I changed my mind, your approach is fine.  I was unclear.
> Currently, I just need you to address the other comments.
> -Sam
> 
> On Wed, Sep 10, 2014 at 3:13 PM, Somnath Roy  wrote:
> > As I understand, you want me to implement the following.
> >
> > 1.  Keep this implementation one sharded optracker for the ios going 
> > through ms_dispatch path.
> >
> > 2. Additionally, for ios going through ms_fast_dispatch, you want me 
> > to implement optracker (without internal shard) per opwq shard
> >
> > Am I right ?
> >
> > Thanks & Regards
> > Somnath
> >
> > -Original Message-
> > From: Samuel Just [mailto:sam.j...@inktank.com]
> > Sent: Wednesday, September 10, 2014 3:08 PM
> > To: Somnath Roy
> > Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org; 
> > ceph-users@lists.ceph.com
> > Subject: Re: OpTracker optimization
> >
> > I don't quite understand.
> > -Sam
> >
> > On Wed, Sep 10, 2014 at 2:38 PM, Somnath Roy  
> > wrote:
> >> Thanks Sam.
> >> So, you want me to go with optracker/shadedopWq , right ?
> >>
> >> Regards
> >> Somnath
> >>
> >> -Original Message-
> >> From: Samuel Just [mailto:sam.j...@inktank.com]
> >> Sent: Wednesday, September 10, 2014 2:36 PM
> >> To: Somnath Roy
> >> Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org; 
> >> ceph-users@lists.ceph.com
> >> Subject: Re: OpTracker optimization
> >>
> >> Responded with cosmetic nonsense.  Once you've got that and the other 
> >> comments addressed, I can put it in wip-sam-testing.
> >> -Sam
> >>
> >> On Wed, Sep 10, 2014 at 1:30 PM, Somnath Roy  
> >> wrote:
> >>> Thanks Sam..I responded back :-)
> >>>
> >>> -Original Message-
> >>> From: ceph-devel-ow...@vger.kernel.org 
> >>> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Samuel Just
> >>> Sent: Wednesday, September 10, 2014 11:17 AM
> >>> To: Somnath Roy
> >>> Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org; 
> >>> ceph-users@lists.ceph.com
> >>> Subject: Re: OpTracker optimization
> >>>
> >>> Added a comment about the approach.
> >>> -Sam
> >>>
> >>> On Tue, Sep 9, 2014 at 1:33 PM, Somnath Roy  
> >>> wrote:
>  Hi Sam/Sage,
> 
>  As we discussed earlier, enabling the present OpTracker code 
>  degrading performance severely. For example, in my setup a single 
>  OSD node with
>  10 clients is reaching ~103K read iops with io served from memory 
>  while optracking is disabled but enabling optracker it is reduced to 
>  ~39K iops.
>  Probably, running OSD without enabling OpTracker is not an option 
>  for many of Ceph users.
> 
>  Now, by sharding the Optracker:: ops_in_flight_lock (thus xlist
>  ops_in_flight) and removing some other bottlenecks I am able to 
>  match the performance of OpTracking enabled OSD with OpTracking 
>  disabled, but with the expense of ~1 extra cpu core.
> 
>  In this process I have also fixed the following tracker.
> 
> 
> 
>  http://tracker.ceph.com/issues/9384
> 
> 
> 
>  and probably http://tracker.ceph.com/issues/8885 too.
> 
> 
> 
>  I have created following pull request for the same. Please review it.
> 
> 
> 
>  https://github.com/ceph/ceph/pull/2440
> 
> 
> 
>  Thanks & Regards
> 
>  Somnath
> 
> 
> 
> 
>  
> 
>  PLEASE NOTE: The information contained in this electronic mail 
>  message is intended only for the use of the designated 
>  recipient(s) named above. If the reader of this message is not 
>  the intended recipient, you are hereby notified that you have 
>  received this message in error and that any review, 
>  dissemination, distribution, or copying of this message is 
>  strictly prohibited. If you have received this communication in 
>  error, please notify the sender by telephone or e-mail (as shown 
>  abo

Re: [ceph-users] Cephfs upon Tiering

2014-09-11 Thread Gregory Farnum
On Thu, Sep 11, 2014 at 4:13 AM, Kenneth Waegeman
 wrote:
> Hi all,
>
> I am testing the tiering functionality with cephfs. I used a replicated
> cache with an EC data pool, and a replicated metadata pool like this:
>
>
> ceph osd pool create cache 1024 1024
> ceph osd pool set cache size 2
> ceph osd pool set cache min_size 1
> ceph osd erasure-code-profile set profile11 k=8 m=3
> ruleset-failure-domain=osd
> ceph osd pool create ecdata 128 128 erasure profile11
> ceph osd tier add ecdata cache
> ceph osd tier cache-mode cache writeback
> ceph osd tier set-overlay ecdata cache
> ceph osd pool set cache hit_set_type bloom
> ceph osd pool set cache hit_set_count 1
> ceph osd pool set cache hit_set_period 3600
> ceph osd pool set cache target_max_bytes $((280*1024*1024*1024))
> ceph osd pool create metadata 128 128
> ceph osd pool set metadata crush_ruleset 1 # SSD root in crushmap
> ceph fs new ceph_fs metadata cache  <-- wrong ?
>
> I started testing with this, and this worked, I could write to it with
> cephfs and the cache was flushing to the ecdata pool as expected.
> But now I notice I made the fs right upon the cache, instead of the
> underlying data pool. I suppose I should have done this:
>
> ceph fs new ceph_fs metadata ecdata
>
> So my question is: Was this wrong and not doing the things I thought it did,
> or was this somehow handled by ceph and didn't it matter I specified the
> cache instead of the data pool?

Well, it's sort of doing what you want it to. You've told the
filesystem to use the "cache" pool as the location for all of its
data. But RADOS is pushing everything in the "cache" pool down to the
"ecdata" pool.
So it'll work for now as you want. But if in future you wanted to stop
using the caching pool, or switch it out for a different pool
entirely, that wouldn't work (whereas it would if the fs was using
"ecdata").

We should perhaps look at prevent use of cache pools like this...hrm...
http://tracker.ceph.com/issues/9435
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OpTracker optimization

2014-09-11 Thread Samuel Just
Just added it to wip-sam-testing.
-Sam

On Thu, Sep 11, 2014 at 11:30 AM, Somnath Roy  wrote:
> Sam/Sage,
> I have addressed all of your comments and pushed the changes to the same pull 
> request.
>
> https://github.com/ceph/ceph/pull/2440
>
> Thanks & Regards
> Somnath
>
> -Original Message-
> From: Sage Weil [mailto:sw...@redhat.com]
> Sent: Wednesday, September 10, 2014 8:33 PM
> To: Somnath Roy
> Cc: Samuel Just; ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com
> Subject: RE: OpTracker optimization
>
> I had two substantiative comments on the first patch and then some trivial
> whitespace nits.Otherwise looks good!
>
> tahnks-
> sage
>
> On Thu, 11 Sep 2014, Somnath Roy wrote:
>
>> Sam/Sage,
>> I have incorporated all of your comments. Please have a look at the same 
>> pull request.
>>
>> https://github.com/ceph/ceph/pull/2440
>>
>> Thanks & Regards
>> Somnath
>>
>> -Original Message-
>> From: Samuel Just [mailto:sam.j...@inktank.com]
>> Sent: Wednesday, September 10, 2014 3:25 PM
>> To: Somnath Roy
>> Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org;
>> ceph-users@lists.ceph.com
>> Subject: Re: OpTracker optimization
>>
>> Oh, I changed my mind, your approach is fine.  I was unclear.
>> Currently, I just need you to address the other comments.
>> -Sam
>>
>> On Wed, Sep 10, 2014 at 3:13 PM, Somnath Roy  wrote:
>> > As I understand, you want me to implement the following.
>> >
>> > 1.  Keep this implementation one sharded optracker for the ios going 
>> > through ms_dispatch path.
>> >
>> > 2. Additionally, for ios going through ms_fast_dispatch, you want me
>> > to implement optracker (without internal shard) per opwq shard
>> >
>> > Am I right ?
>> >
>> > Thanks & Regards
>> > Somnath
>> >
>> > -Original Message-
>> > From: Samuel Just [mailto:sam.j...@inktank.com]
>> > Sent: Wednesday, September 10, 2014 3:08 PM
>> > To: Somnath Roy
>> > Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org;
>> > ceph-users@lists.ceph.com
>> > Subject: Re: OpTracker optimization
>> >
>> > I don't quite understand.
>> > -Sam
>> >
>> > On Wed, Sep 10, 2014 at 2:38 PM, Somnath Roy  
>> > wrote:
>> >> Thanks Sam.
>> >> So, you want me to go with optracker/shadedopWq , right ?
>> >>
>> >> Regards
>> >> Somnath
>> >>
>> >> -Original Message-
>> >> From: Samuel Just [mailto:sam.j...@inktank.com]
>> >> Sent: Wednesday, September 10, 2014 2:36 PM
>> >> To: Somnath Roy
>> >> Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org;
>> >> ceph-users@lists.ceph.com
>> >> Subject: Re: OpTracker optimization
>> >>
>> >> Responded with cosmetic nonsense.  Once you've got that and the other 
>> >> comments addressed, I can put it in wip-sam-testing.
>> >> -Sam
>> >>
>> >> On Wed, Sep 10, 2014 at 1:30 PM, Somnath Roy  
>> >> wrote:
>> >>> Thanks Sam..I responded back :-)
>> >>>
>> >>> -Original Message-
>> >>> From: ceph-devel-ow...@vger.kernel.org
>> >>> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Samuel Just
>> >>> Sent: Wednesday, September 10, 2014 11:17 AM
>> >>> To: Somnath Roy
>> >>> Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org;
>> >>> ceph-users@lists.ceph.com
>> >>> Subject: Re: OpTracker optimization
>> >>>
>> >>> Added a comment about the approach.
>> >>> -Sam
>> >>>
>> >>> On Tue, Sep 9, 2014 at 1:33 PM, Somnath Roy  
>> >>> wrote:
>>  Hi Sam/Sage,
>> 
>>  As we discussed earlier, enabling the present OpTracker code
>>  degrading performance severely. For example, in my setup a single
>>  OSD node with
>>  10 clients is reaching ~103K read iops with io served from memory
>>  while optracking is disabled but enabling optracker it is reduced to 
>>  ~39K iops.
>>  Probably, running OSD without enabling OpTracker is not an option
>>  for many of Ceph users.
>> 
>>  Now, by sharding the Optracker:: ops_in_flight_lock (thus xlist
>>  ops_in_flight) and removing some other bottlenecks I am able to
>>  match the performance of OpTracking enabled OSD with OpTracking
>>  disabled, but with the expense of ~1 extra cpu core.
>> 
>>  In this process I have also fixed the following tracker.
>> 
>> 
>> 
>>  http://tracker.ceph.com/issues/9384
>> 
>> 
>> 
>>  and probably http://tracker.ceph.com/issues/8885 too.
>> 
>> 
>> 
>>  I have created following pull request for the same. Please review it.
>> 
>> 
>> 
>>  https://github.com/ceph/ceph/pull/2440
>> 
>> 
>> 
>>  Thanks & Regards
>> 
>>  Somnath
>> 
>> 
>> 
>> 
>>  
>> 
>>  PLEASE NOTE: The information contained in this electronic mail
>>  message is intended only for the use of the designated
>>  recipient(s) named above. If the reader of this message is not
>>  the intended recipient, you are hereby notified that you have
>>  received this message in erro

Re: [ceph-users] Cephfs upon Tiering

2014-09-11 Thread Sage Weil
On Thu, 11 Sep 2014, Gregory Farnum wrote:
> On Thu, Sep 11, 2014 at 4:13 AM, Kenneth Waegeman
>  wrote:
> > Hi all,
> >
> > I am testing the tiering functionality with cephfs. I used a replicated
> > cache with an EC data pool, and a replicated metadata pool like this:
> >
> >
> > ceph osd pool create cache 1024 1024
> > ceph osd pool set cache size 2
> > ceph osd pool set cache min_size 1
> > ceph osd erasure-code-profile set profile11 k=8 m=3
> > ruleset-failure-domain=osd
> > ceph osd pool create ecdata 128 128 erasure profile11
> > ceph osd tier add ecdata cache
> > ceph osd tier cache-mode cache writeback
> > ceph osd tier set-overlay ecdata cache
> > ceph osd pool set cache hit_set_type bloom
> > ceph osd pool set cache hit_set_count 1
> > ceph osd pool set cache hit_set_period 3600
> > ceph osd pool set cache target_max_bytes $((280*1024*1024*1024))
> > ceph osd pool create metadata 128 128
> > ceph osd pool set metadata crush_ruleset 1 # SSD root in crushmap
> > ceph fs new ceph_fs metadata cache  <-- wrong ?
> >
> > I started testing with this, and this worked, I could write to it with
> > cephfs and the cache was flushing to the ecdata pool as expected.
> > But now I notice I made the fs right upon the cache, instead of the
> > underlying data pool. I suppose I should have done this:
> >
> > ceph fs new ceph_fs metadata ecdata
> >
> > So my question is: Was this wrong and not doing the things I thought it did,
> > or was this somehow handled by ceph and didn't it matter I specified the
> > cache instead of the data pool?
> 
> Well, it's sort of doing what you want it to. You've told the
> filesystem to use the "cache" pool as the location for all of its
> data. But RADOS is pushing everything in the "cache" pool down to the
> "ecdata" pool.
> So it'll work for now as you want. But if in future you wanted to stop
> using the caching pool, or switch it out for a different pool
> entirely, that wouldn't work (whereas it would if the fs was using
> "ecdata").
> 
> We should perhaps look at prevent use of cache pools like this...hrm...
> http://tracker.ceph.com/issues/9435

Should we?  I was planning on doing exactly this for my home cluster.

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS

2014-09-11 Thread Cedric Lemarchand

Le 11/09/2014 19:33, Cedric Lemarchand a écrit :
> Le 11/09/2014 08:20, Alexandre DERUMIER a écrit :
>> Hi Sebastien,
>>
>> here my first results with crucial m550 (I'll send result with intel s3500 
>> later):
>>
>> - 3 nodes
>> - dell r620 without expander backplane
>> - sas controller : lsi LSI 9207 (no hardware raid or cache)
>> - 2 x E5-2603v2 1.8GHz (4cores)
>> - 32GB ram
>> - network : 2xgigabit link lacp + 2xgigabit lacp for cluster replication.
>>
>> -os : debian wheezy, with kernel 3.10
>>
>> os + ceph mon : 2x intel s3500 100gb  linux soft raid
>> osd : crucial m550 (1TB).
>>
>>
>> 3mon in the ceph cluster,
>> and 1 osd (journal and datas on same disk)
>>
>>
>> ceph.conf 
>> -
>>   debug_lockdep = 0/0
>>   debug_context = 0/0
>>   debug_crush = 0/0
>>   debug_buffer = 0/0
>>   debug_timer = 0/0
>>   debug_filer = 0/0
>>   debug_objecter = 0/0
>>   debug_rados = 0/0
>>   debug_rbd = 0/0
>>   debug_journaler = 0/0
>>   debug_objectcatcher = 0/0
>>   debug_client = 0/0
>>   debug_osd = 0/0
>>   debug_optracker = 0/0
>>   debug_objclass = 0/0
>>   debug_filestore = 0/0
>>   debug_journal = 0/0
>>   debug_ms = 0/0
>>   debug_monc = 0/0
>>   debug_tp = 0/0
>>   debug_auth = 0/0
>>   debug_finisher = 0/0
>>   debug_heartbeatmap = 0/0
>>   debug_perfcounter = 0/0
>>   debug_asok = 0/0
>>   debug_throttle = 0/0
>>   debug_mon = 0/0
>>   debug_paxos = 0/0
>>   debug_rgw = 0/0
>>   osd_op_threads = 5
>>   filestore_op_threads = 4
>>
>>  ms_nocrc = true
>>  cephx sign messages = false
>>  cephx require signatures = false
>>
>>  ms_dispatch_throttle_bytes = 0
>>
>>  #0.85
>>  throttler_perf_counter = false
>>  filestore_fd_cache_size = 64
>>  filestore_fd_cache_shards = 32
>>  osd_op_num_threads_per_shard = 1
>>  osd_op_num_shards = 25
>>  osd_enable_op_tracker = true
>>
>>
>>
>> Fio disk 4K benchmark
>> --
>> rand read 4k : fio --filename=/dev/sdb --direct=1 --rw=randread --bs=4k 
>> --iodepth=32 --group_reporting --invalidate=0 --name=abc --ioengine=aio
>> bw=271755KB/s, iops=67938 
>>
>> rand write 4k : fio --filename=/dev/sdb --direct=1 --rw=randwrite --bs=4k 
>> --iodepth=32 --group_reporting --invalidate=0 --name=abc --ioengine=aio
>> bw=228293KB/s, iops=57073
>>
>>
>>
>> fio osd benchmark (through librbd)
>> --
>> [global]
>> ioengine=rbd
>> clientname=admin
>> pool=test
>> rbdname=test
>> invalidate=0# mandatory
>> rw=randwrite
>> rw=randread
>> bs=4k
>> direct=1
>> numjobs=4
>> group_reporting=1
>>
>> [rbd_iodepth32]
>> iodepth=32
>>
>>
>>
>> FIREFLY RESULTS
>> 
>> fio randwrite : bw=5009.6KB/s, iops=1252
>>
>> fio randread: bw=37820KB/s, iops=9455
>>
>>
>>
>> O.85 RESULTS
>> 
>>
>> fio randwrite : bw=11658KB/s, iops=2914
>>
>> fio randread : bw=38642KB/s, iops=9660
>>
>>
>>
>> 0.85 + osd_enable_op_tracker=false
>> ---
>> fio randwrite : bw=11630KB/s, iops=2907
>> fio randread : bw=80606KB/s, iops=20151,   (cpu 100% - GREAT !)
>>
>>
>>
>> So, for read, seem that osd_enable_op_tracker is the bottleneck.
>>
>>
>> Now for write, I really don't understand why it's so low.
>>
>>
>> I have done some iostat:
>>
>>
>> FIO directly on /dev/sdb
>> bw=228293KB/s, iops=57073
>>
>> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
>> avgqu-sz   await r_await w_await  svctm  %util
>> sdb   0,00 0,000,00 63613,00 0,00 254452,00 8,00 
>>31,240,490,000,49   0,02 100,00
>>
>>
>> FIO directly on osd through librbd
>> bw=11658KB/s, iops=2914
>>
>> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
>> avgqu-sz   await r_await w_await  svctm  %util
>> sdb   0,00   355,000,00 5225,00 0,00 29678,0011,36   
>>  57,63   11,030,00   11,03   0,19  99,70
>>
>>
>> (I don't understand what exactly is %util, 100% in the 2 cases, because 10x 
>> slower with ceph)
> It would be interesting if you could catch the size of writes on SSD
> during the bench through librbd (I know nmon can do that)
Replying to myself ... I ask a bit quickly in the way we already have
this information (29678 / 5225 = 5,68Ko), but this is irrelevant.

Cheers

>> It could be a dsync problem, result seem pretty poor
>>
>> # dd if=rand.file of=/dev/sdb bs=4k count=65536 oflag=direct
>> 65536+0 enregistrements lus
>> 65536+0 enregistrements écrits
>> 268435456 octets (268 MB) copiés, 2,77433 s, 96,8 MB/s
>>
>>
>> # dd if=rand.file of=/dev/sdb bs=4k count=65536 oflag=dsync,direct
>> ^C17228+0 enregistrements lus
>> 17228+0 enregistrements écrits
>> 70565888 octets (71 MB) copiés, 70,4098 s, 1,

Re: [ceph-users] Cephfs upon Tiering

2014-09-11 Thread Gregory Farnum
On Thu, Sep 11, 2014 at 11:39 AM, Sage Weil  wrote:
> On Thu, 11 Sep 2014, Gregory Farnum wrote:
>> On Thu, Sep 11, 2014 at 4:13 AM, Kenneth Waegeman
>>  wrote:
>> > Hi all,
>> >
>> > I am testing the tiering functionality with cephfs. I used a replicated
>> > cache with an EC data pool, and a replicated metadata pool like this:
>> >
>> >
>> > ceph osd pool create cache 1024 1024
>> > ceph osd pool set cache size 2
>> > ceph osd pool set cache min_size 1
>> > ceph osd erasure-code-profile set profile11 k=8 m=3
>> > ruleset-failure-domain=osd
>> > ceph osd pool create ecdata 128 128 erasure profile11
>> > ceph osd tier add ecdata cache
>> > ceph osd tier cache-mode cache writeback
>> > ceph osd tier set-overlay ecdata cache
>> > ceph osd pool set cache hit_set_type bloom
>> > ceph osd pool set cache hit_set_count 1
>> > ceph osd pool set cache hit_set_period 3600
>> > ceph osd pool set cache target_max_bytes $((280*1024*1024*1024))
>> > ceph osd pool create metadata 128 128
>> > ceph osd pool set metadata crush_ruleset 1 # SSD root in crushmap
>> > ceph fs new ceph_fs metadata cache  <-- wrong ?
>> >
>> > I started testing with this, and this worked, I could write to it with
>> > cephfs and the cache was flushing to the ecdata pool as expected.
>> > But now I notice I made the fs right upon the cache, instead of the
>> > underlying data pool. I suppose I should have done this:
>> >
>> > ceph fs new ceph_fs metadata ecdata
>> >
>> > So my question is: Was this wrong and not doing the things I thought it 
>> > did,
>> > or was this somehow handled by ceph and didn't it matter I specified the
>> > cache instead of the data pool?
>>
>> Well, it's sort of doing what you want it to. You've told the
>> filesystem to use the "cache" pool as the location for all of its
>> data. But RADOS is pushing everything in the "cache" pool down to the
>> "ecdata" pool.
>> So it'll work for now as you want. But if in future you wanted to stop
>> using the caching pool, or switch it out for a different pool
>> entirely, that wouldn't work (whereas it would if the fs was using
>> "ecdata").
>>
>> We should perhaps look at prevent use of cache pools like this...hrm...
>> http://tracker.ceph.com/issues/9435
>
> Should we?  I was planning on doing exactly this for my home cluster.

Not cache pools under CephFS, but specifying the cache pool as the
data pool (rather than some underlying pool). Or is there some reason
we might want the cache pool to be the one the filesystem is using for
indexing?
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgraded now MDS won't start

2014-09-11 Thread Gregory Farnum
On Wed, Sep 10, 2014 at 4:24 PM, McNamara, Bradley
 wrote:
> Hello,
>
> This is my first real issue since running Ceph for several months.  Here's 
> the situation:
>
> I've been running an Emperor cluster for several months.  All was good.  I 
> decided to upgrade since I'm running Ubuntu 13.10 and 0.72.2.  I decided to 
> first upgrade Ceph to 0.80.4, which was the last version in the apt 
> repository for 13.10.  I upgrade the MON's, then the OSD servers to 0.80.4; 
> all went as expected with no issues.  The last thing I did was upgrade the 
> MDS using the same process, but now the MDS won't start.  I've tried to 
> manually start the MDS with debugging on, and I have attached the file.  It 
> complains that it's looking for "mds.0.20  need osdmap epoch 3602, have 3601".
>
> Anyway, I'd don't really use CephFS or RGW, so I don't need the MDS, but I'd 
> like to have it.  Can someone tell me how to fix it, or delete it, so I can 
> start over when I do need it?  Right now my cluster is HEALTH_WARN because of 
> it.

Uh, the log is from an MDS running Emperor. That one looks like it's
complaining because the mds data formats got updated for Firefly. ;)
You'll need to run debugging from a Firefly mds to try and get
something useful.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs upon Tiering

2014-09-11 Thread Sage Weil
On Thu, 11 Sep 2014, Gregory Farnum wrote:
> On Thu, Sep 11, 2014 at 11:39 AM, Sage Weil  wrote:
> > On Thu, 11 Sep 2014, Gregory Farnum wrote:
> >> On Thu, Sep 11, 2014 at 4:13 AM, Kenneth Waegeman
> >>  wrote:
> >> > Hi all,
> >> >
> >> > I am testing the tiering functionality with cephfs. I used a replicated
> >> > cache with an EC data pool, and a replicated metadata pool like this:
> >> >
> >> >
> >> > ceph osd pool create cache 1024 1024
> >> > ceph osd pool set cache size 2
> >> > ceph osd pool set cache min_size 1
> >> > ceph osd erasure-code-profile set profile11 k=8 m=3
> >> > ruleset-failure-domain=osd
> >> > ceph osd pool create ecdata 128 128 erasure profile11
> >> > ceph osd tier add ecdata cache
> >> > ceph osd tier cache-mode cache writeback
> >> > ceph osd tier set-overlay ecdata cache
> >> > ceph osd pool set cache hit_set_type bloom
> >> > ceph osd pool set cache hit_set_count 1
> >> > ceph osd pool set cache hit_set_period 3600
> >> > ceph osd pool set cache target_max_bytes $((280*1024*1024*1024))
> >> > ceph osd pool create metadata 128 128
> >> > ceph osd pool set metadata crush_ruleset 1 # SSD root in crushmap
> >> > ceph fs new ceph_fs metadata cache  <-- wrong ?
> >> >
> >> > I started testing with this, and this worked, I could write to it with
> >> > cephfs and the cache was flushing to the ecdata pool as expected.
> >> > But now I notice I made the fs right upon the cache, instead of the
> >> > underlying data pool. I suppose I should have done this:
> >> >
> >> > ceph fs new ceph_fs metadata ecdata
> >> >
> >> > So my question is: Was this wrong and not doing the things I thought it 
> >> > did,
> >> > or was this somehow handled by ceph and didn't it matter I specified the
> >> > cache instead of the data pool?
> >>
> >> Well, it's sort of doing what you want it to. You've told the
> >> filesystem to use the "cache" pool as the location for all of its
> >> data. But RADOS is pushing everything in the "cache" pool down to the
> >> "ecdata" pool.
> >> So it'll work for now as you want. But if in future you wanted to stop
> >> using the caching pool, or switch it out for a different pool
> >> entirely, that wouldn't work (whereas it would if the fs was using
> >> "ecdata").
> >>
> >> We should perhaps look at prevent use of cache pools like this...hrm...
> >> http://tracker.ceph.com/issues/9435
> >
> > Should we?  I was planning on doing exactly this for my home cluster.
> 
> Not cache pools under CephFS, but specifying the cache pool as the
> data pool (rather than some underlying pool). Or is there some reason
> we might want the cache pool to be the one the filesystem is using for
> indexing?

Oh, right.  Yeah that's fine.  :)

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgraded now MDS won't start

2014-09-11 Thread McNamara, Bradley
That portion of the log confused me, too.  However, I had run the same upgrade 
process on the MDS as all the other cluster components.  Firefly was actually 
installed on the MDS even though the log mentions 0.72.2.

At any rate, I ended up stopping the MDS and using 'newfs' on the metadata and 
data pools to eliminate the HEALTH_WARN issue.

-Original Message-
From: Gregory Farnum [mailto:g...@inktank.com] 
Sent: Thursday, September 11, 2014 2:09 PM
To: McNamara, Bradley
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Upgraded now MDS won't start

On Wed, Sep 10, 2014 at 4:24 PM, McNamara, Bradley 
 wrote:
> Hello,
>
> This is my first real issue since running Ceph for several months.  Here's 
> the situation:
>
> I've been running an Emperor cluster for several months.  All was good.  I 
> decided to upgrade since I'm running Ubuntu 13.10 and 0.72.2.  I decided to 
> first upgrade Ceph to 0.80.4, which was the last version in the apt 
> repository for 13.10.  I upgrade the MON's, then the OSD servers to 0.80.4; 
> all went as expected with no issues.  The last thing I did was upgrade the 
> MDS using the same process, but now the MDS won't start.  I've tried to 
> manually start the MDS with debugging on, and I have attached the file.  It 
> complains that it's looking for "mds.0.20  need osdmap epoch 3602, have 3601".
>
> Anyway, I'd don't really use CephFS or RGW, so I don't need the MDS, but I'd 
> like to have it.  Can someone tell me how to fix it, or delete it, so I can 
> start over when I do need it?  Right now my cluster is HEALTH_WARN because of 
> it.

Uh, the log is from an MDS running Emperor. That one looks like it's 
complaining because the mds data formats got updated for Firefly. ;) You'll 
need to run debugging from a Firefly mds to try and get something useful.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Consistent hashing

2014-09-11 Thread Jakes John
Hi,
I would like to know few points regarding the consistent hashing of
CRUSH algorithm.  When I read the algorithm, I noticed that if a selected
bucket(device) is failed or overloaded, it skips and selects a new bucket.
Similar is the case if collision happens. If such an event happens, how is
it ensured that it gives the same set of osds during each run of crush
algorithm?.

In other words, How does the crush algorithm produces same output in both
scenarios 1) when there are devices in the cluster that are
failed/overloaded or collisions occur between selections 2)When all devices
in the cluster are ready to be mapped? .

In scenario 1, cluster map remains the same but, number of devices that are
ready to be mapped are less. In scenario 2,  failed or overloaded devices
are restored.

It would be helpful if someone can help me out to point out how this is
handled?.

Thanks
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Regarding key/value interface

2014-09-11 Thread Somnath Roy
Hi Sage/Haomai,
If I have a key/value backend that support transaction, range queries (and I 
don't need any explicit caching etc.) and I want to replace filestore (and 
leveldb omap) with that,  which interface you recommend me to derive from , 
directly ObjectStore or  KeyValueDB ?
I have already integrated this backend by deriving from ObjectStore interfaces 
earlier (pre keyvalueinteface days) but not tested thoroughly enough to see 
what functionality is broken (Basic functionalities of RGW/RBD are working 
fine).
Basically, I want to know what are the advantages (and disadvantages) of 
deriving it from the new key/value interfaces ?
Also, what state is it in ? Is it feature complete and supporting all the 
ObjectStore interfaces like clone and all ?

Thanks & Regards
Somnath



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Regarding key/value interface

2014-09-11 Thread Sage Weil
Hi Somnath,

On Fri, 12 Sep 2014, Somnath Roy wrote:
> 
> Hi Sage/Haomai,
> 
> If I have a key/value backend that support transaction, range queries (and I
> don?t need any explicit caching etc.) and I want to replace filestore (and
> leveldb omap) with that,  which interface you recommend me to derive from ,
> directly ObjectStore or  KeyValueDB ?
> 
> I have already integrated this backend by deriving from ObjectStore
> interfaces earlier (pre keyvalueinteface days) but not tested thoroughly
> enough to see what functionality is broken (Basic functionalities of RGW/RBD
> are working fine).
> 
> Basically, I want to know what are the advantages (and disadvantages) of
> deriving it from the new key/value interfaces ?
> 
> Also, what state is it in ? Is it feature complete and supporting all the
> ObjectStore interfaces like clone and all ?

Everything is supported, I think, for perhaps some IO hints that don't 
make sense in a k/v context.  The big things that you get by using 
KeyValueStore and plugging into the lower-level interface are:

 - striping of file data across keys
 - efficient clone
 - a zillion smaller methods that aren't conceptually difficult to 
implement bug tedious and to do so.

The other nice thing about reusing this code is that you can use a leveldb 
or rocksdb backend as a reference for testing or performance or whatever.

The main thing that will be a challenge going forward, I predict, is 
making storage of the object byte payload in key/value pairs efficient.  I 
think KeyValuestore is doing some simple striping, but it will suffer for 
small overwrites (like 512-byte or 4k writes from an RBD).  There are 
probably some pretty simple heuristics and tricks that can be done to 
mitigate the most common patterns, but there is no simple solution since 
the backends generally don't support partial value updates (I assume yours 
doesn't either?).  But, any work done here will benefit the other backends 
too so that would be a win..

sage___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Regarding key/value interface

2014-09-11 Thread Somnath Roy
Make perfect sense Sage..

Regarding striping of filedata, You are saying KeyValue interface will do the 
following for me?

1. Say in case of rbd image of order 4 MB, a write request coming to Key/Value 
interface, it will  chunk the object (say full 4MB) in smaller sizes 
(configurable ?) and stripe it as multiple key/value pair ?

2. Also, while reading it will take care of accumulating and send it back.


Thanks & Regards
Somnath


-Original Message-
From: Sage Weil [mailto:sw...@redhat.com]
Sent: Thursday, September 11, 2014 6:31 PM
To: Somnath Roy
Cc: Haomai Wang (haomaiw...@gmail.com); ceph-users@lists.ceph.com; 
ceph-de...@vger.kernel.org
Subject: Re: Regarding key/value interface

Hi Somnath,

On Fri, 12 Sep 2014, Somnath Roy wrote:
>
> Hi Sage/Haomai,
>
> If I have a key/value backend that support transaction, range queries
> (and I don?t need any explicit caching etc.) and I want to replace
> filestore (and leveldb omap) with that,  which interface you recommend
> me to derive from , directly ObjectStore or  KeyValueDB ?
>
> I have already integrated this backend by deriving from ObjectStore
> interfaces earlier (pre keyvalueinteface days) but not tested
> thoroughly enough to see what functionality is broken (Basic
> functionalities of RGW/RBD are working fine).
>
> Basically, I want to know what are the advantages (and disadvantages)
> of deriving it from the new key/value interfaces ?
>
> Also, what state is it in ? Is it feature complete and supporting all
> the ObjectStore interfaces like clone and all ?

Everything is supported, I think, for perhaps some IO hints that don't make 
sense in a k/v context.  The big things that you get by using KeyValueStore and 
plugging into the lower-level interface are:

 - striping of file data across keys
 - efficient clone
 - a zillion smaller methods that aren't conceptually difficult to implement 
bug tedious and to do so.

The other nice thing about reusing this code is that you can use a leveldb or 
rocksdb backend as a reference for testing or performance or whatever.

The main thing that will be a challenge going forward, I predict, is making 
storage of the object byte payload in key/value pairs efficient.  I think 
KeyValuestore is doing some simple striping, but it will suffer for small 
overwrites (like 512-byte or 4k writes from an RBD).  There are probably some 
pretty simple heuristics and tricks that can be done to mitigate the most 
common patterns, but there is no simple solution since the backends generally 
don't support partial value updates (I assume yours doesn't either?).  But, any 
work done here will benefit the other backends too so that would be a win..

sage



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph object back up details

2014-09-11 Thread M Ranga Swami Reddy
Thanks for details.
Thanks
Swami
On Sep 8, 2014 9:21 PM, "Yehuda Sadeh"  wrote:

> Not sure I understand what you ask. Multiple zones within the same
> region configuration is described here:
>
>
> http://ceph.com/docs/master/radosgw/federated-config/#multi-site-data-replication
>
> Yehuda
>
> On Sun, Sep 7, 2014 at 10:32 PM, M Ranga Swami Reddy
>  wrote:
> > Hi Yahuda,
> > I need more info on Ceph object backup mechanism.. Could  please share
> > a related doc or link for this?
> > Thanks
> > Swami
> >
> > On Thu, Sep 4, 2014 at 10:58 PM, M Ranga Swami Reddy
> >  wrote:
> >> Hi,
> >> I need more info on Ceph object backup mechanism.. Could someone share a
> >> related doc or link for this?
> >> Thanks
> >> Swami
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Regarding key/value interface

2014-09-11 Thread Sage Weil
On Fri, 12 Sep 2014, Somnath Roy wrote:
> Make perfect sense Sage..
> 
> Regarding striping of filedata, You are saying KeyValue interface will do the 
> following for me?
> 
> 1. Say in case of rbd image of order 4 MB, a write request coming to 
> Key/Value interface, it will  chunk the object (say full 4MB) in smaller 
> sizes (configurable ?) and stripe it as multiple key/value pair ?
> 
> 2. Also, while reading it will take care of accumulating and send it back.

Precisely.

A smarter thing we might want to make it do in the future would be to take 
a 4 KB write create a new key that logically overwrites part of the 
larger, say, 1MB key, and apply it on read.  And maybe give up and rewrite 
the entire 1MB stripe after too many small overwrites have accumulated.  
Something along those lines to reduce the cost of small IOs to large 
objects.

sage



 > 
> Thanks & Regards
> Somnath
> 
> 
> -Original Message-
> From: Sage Weil [mailto:sw...@redhat.com]
> Sent: Thursday, September 11, 2014 6:31 PM
> To: Somnath Roy
> Cc: Haomai Wang (haomaiw...@gmail.com); ceph-users@lists.ceph.com; 
> ceph-de...@vger.kernel.org
> Subject: Re: Regarding key/value interface
> 
> Hi Somnath,
> 
> On Fri, 12 Sep 2014, Somnath Roy wrote:
> >
> > Hi Sage/Haomai,
> >
> > If I have a key/value backend that support transaction, range queries
> > (and I don?t need any explicit caching etc.) and I want to replace
> > filestore (and leveldb omap) with that,  which interface you recommend
> > me to derive from , directly ObjectStore or  KeyValueDB ?
> >
> > I have already integrated this backend by deriving from ObjectStore
> > interfaces earlier (pre keyvalueinteface days) but not tested
> > thoroughly enough to see what functionality is broken (Basic
> > functionalities of RGW/RBD are working fine).
> >
> > Basically, I want to know what are the advantages (and disadvantages)
> > of deriving it from the new key/value interfaces ?
> >
> > Also, what state is it in ? Is it feature complete and supporting all
> > the ObjectStore interfaces like clone and all ?
> 
> Everything is supported, I think, for perhaps some IO hints that don't make 
> sense in a k/v context.  The big things that you get by using KeyValueStore 
> and plugging into the lower-level interface are:
> 
>  - striping of file data across keys
>  - efficient clone
>  - a zillion smaller methods that aren't conceptually difficult to implement 
> bug tedious and to do so.
> 
> The other nice thing about reusing this code is that you can use a leveldb or 
> rocksdb backend as a reference for testing or performance or whatever.
> 
> The main thing that will be a challenge going forward, I predict, is making 
> storage of the object byte payload in key/value pairs efficient.  I think 
> KeyValuestore is doing some simple striping, but it will suffer for small 
> overwrites (like 512-byte or 4k writes from an RBD).  There are probably some 
> pretty simple heuristics and tricks that can be done to mitigate the most 
> common patterns, but there is no simple solution since the backends generally 
> don't support partial value updates (I assume yours doesn't either?).  But, 
> any work done here will benefit the other backends too so that would be a 
> win..
> 
> sage
> 
> 
> 
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Regarding key/value interface

2014-09-11 Thread Haomai Wang
On Fri, Sep 12, 2014 at 9:46 AM, Somnath Roy 
wrote:

> Make perfect sense Sage..
>
> Regarding striping of filedata, You are saying KeyValue interface will do
> the following for me?
>
> 1. Say in case of rbd image of order 4 MB, a write request coming to
> Key/Value interface, it will  chunk the object (say full 4MB) in smaller
> sizes (configurable ?) and stripe it as multiple key/value pair ?
>

Yes, and the stripe size can be configurated.


>
> 2. Also, while reading it will take care of accumulating and send it back.
>

Do you have any other idea? By the way, could you tell more about your
key/value interface. I'm doing some jobs for NVMe interface with intel NVMe
SSD.


>
>
> Thanks & Regards
> Somnath
>
>
> -Original Message-
> From: Sage Weil [mailto:sw...@redhat.com]
> Sent: Thursday, September 11, 2014 6:31 PM
> To: Somnath Roy
> Cc: Haomai Wang (haomaiw...@gmail.com); ceph-users@lists.ceph.com;
> ceph-de...@vger.kernel.org
> Subject: Re: Regarding key/value interface
>
> Hi Somnath,
>
> On Fri, 12 Sep 2014, Somnath Roy wrote:
> >
> > Hi Sage/Haomai,
> >
> > If I have a key/value backend that support transaction, range queries
> > (and I don?t need any explicit caching etc.) and I want to replace
> > filestore (and leveldb omap) with that,  which interface you recommend
> > me to derive from , directly ObjectStore or  KeyValueDB ?
> >
> > I have already integrated this backend by deriving from ObjectStore
> > interfaces earlier (pre keyvalueinteface days) but not tested
> > thoroughly enough to see what functionality is broken (Basic
> > functionalities of RGW/RBD are working fine).
> >
> > Basically, I want to know what are the advantages (and disadvantages)
> > of deriving it from the new key/value interfaces ?
> >
> > Also, what state is it in ? Is it feature complete and supporting all
> > the ObjectStore interfaces like clone and all ?
>
> Everything is supported, I think, for perhaps some IO hints that don't
> make sense in a k/v context.  The big things that you get by using
> KeyValueStore and plugging into the lower-level interface are:
>
>  - striping of file data across keys
>  - efficient clone
>  - a zillion smaller methods that aren't conceptually difficult to
> implement bug tedious and to do so.
>
> The other nice thing about reusing this code is that you can use a leveldb
> or rocksdb backend as a reference for testing or performance or whatever.
>
> The main thing that will be a challenge going forward, I predict, is
> making storage of the object byte payload in key/value pairs efficient.  I
> think KeyValuestore is doing some simple striping, but it will suffer for
> small overwrites (like 512-byte or 4k writes from an RBD).  There are
> probably some pretty simple heuristics and tricks that can be done to
> mitigate the most common patterns, but there is no simple solution since
> the backends generally don't support partial value updates (I assume yours
> doesn't either?).  But, any work done here will benefit the other backends
> too so that would be a win..
>
> sage
>
> 
>
> PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the designated recipient(s) named above. If
> the reader of this message is not the intended recipient, you are hereby
> notified that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly
> prohibited. If you have received this communication in error, please notify
> the sender by telephone or e-mail (as shown above) immediately and destroy
> any and all copies of this message in your possession (whether hard copies
> or electronically stored copies).
>
>


-- 

Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Regarding key/value interface

2014-09-11 Thread Haomai Wang
On Fri, Sep 12, 2014 at 9:46 AM, Somnath Roy 
wrote:

> Make perfect sense Sage..
>
> Regarding striping of filedata, You are saying KeyValue interface will do
> the following for me?
>
> 1. Say in case of rbd image of order 4 MB, a write request coming to
> Key/Value interface, it will  chunk the object (say full 4MB) in smaller
> sizes (configurable ?) and stripe it as multiple key/value pair ?
>
>
Yes, and the stripe size can be configurated.


> 2. Also, while reading it will take care of accumulating and send it back.
>

Do you have any other idea? By the way, could you tell more about your
key/value interface. I'm doing some jobs for NVMe interface with intel NVMe
SSD.


>
>
> Thanks & Regards
> Somnath
>
>
> -Original Message-
> From: Sage Weil [mailto:sw...@redhat.com]
> Sent: Thursday, September 11, 2014 6:31 PM
> To: Somnath Roy
> Cc: Haomai Wang (haomaiw...@gmail.com); ceph-users@lists.ceph.com;
> ceph-de...@vger.kernel.org
> Subject: Re: Regarding key/value interface
>
> Hi Somnath,
>
> On Fri, 12 Sep 2014, Somnath Roy wrote:
> >
> > Hi Sage/Haomai,
> >
> > If I have a key/value backend that support transaction, range queries
> > (and I don?t need any explicit caching etc.) and I want to replace
> > filestore (and leveldb omap) with that,  which interface you recommend
> > me to derive from , directly ObjectStore or  KeyValueDB ?
> >
> > I have already integrated this backend by deriving from ObjectStore
> > interfaces earlier (pre keyvalueinteface days) but not tested
> > thoroughly enough to see what functionality is broken (Basic
> > functionalities of RGW/RBD are working fine).
> >
> > Basically, I want to know what are the advantages (and disadvantages)
> > of deriving it from the new key/value interfaces ?
> >
> > Also, what state is it in ? Is it feature complete and supporting all
> > the ObjectStore interfaces like clone and all ?
>
> Everything is supported, I think, for perhaps some IO hints that don't
> make sense in a k/v context.  The big things that you get by using
> KeyValueStore and plugging into the lower-level interface are:
>
>  - striping of file data across keys
>  - efficient clone
>  - a zillion smaller methods that aren't conceptually difficult to
> implement bug tedious and to do so.
>
> The other nice thing about reusing this code is that you can use a leveldb
> or rocksdb backend as a reference for testing or performance or whatever.
>
> The main thing that will be a challenge going forward, I predict, is
> making storage of the object byte payload in key/value pairs efficient.  I
> think KeyValuestore is doing some simple striping, but it will suffer for
> small overwrites (like 512-byte or 4k writes from an RBD).  There are
> probably some pretty simple heuristics and tricks that can be done to
> mitigate the most common patterns, but there is no simple solution since
> the backends generally don't support partial value updates (I assume yours
> doesn't either?).  But, any work done here will benefit the other backends
> too so that would be a win..
>
> sage
>
> 
>
> PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the designated recipient(s) named above. If
> the reader of this message is not the intended recipient, you are hereby
> notified that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly
> prohibited. If you have received this communication in error, please notify
> the sender by telephone or e-mail (as shown above) immediately and destroy
> any and all copies of this message in your possession (whether hard copies
> or electronically stored copies).
>
>


-- 

Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Regarding key/value interface

2014-09-11 Thread Haomai Wang
On Fri, Sep 12, 2014 at 9:46 AM, Somnath Roy  wrote:
>
> Make perfect sense Sage..
>
> Regarding striping of filedata, You are saying KeyValue interface will do the 
> following for me?
>
> 1. Say in case of rbd image of order 4 MB, a write request coming to 
> Key/Value interface, it will  chunk the object (say full 4MB) in smaller 
> sizes (configurable ?) and stripe it as multiple key/value pair ?


Yes, and the stripe size can be configurated.

>
>
> 2. Also, while reading it will take care of accumulating and send it back.



Do you have any other idea? By the way, could you tell more about your
key/value interface. I'm doing some jobs for NVMe interface with intel
NVMe SSD.

>
>
>
> Thanks & Regards
> Somnath
>
>
> -Original Message-
> From: Sage Weil [mailto:sw...@redhat.com]
> Sent: Thursday, September 11, 2014 6:31 PM
> To: Somnath Roy
> Cc: Haomai Wang (haomaiw...@gmail.com); ceph-users@lists.ceph.com; 
> ceph-de...@vger.kernel.org
> Subject: Re: Regarding key/value interface
>
> Hi Somnath,
>
> On Fri, 12 Sep 2014, Somnath Roy wrote:
> >
> > Hi Sage/Haomai,
> >
> > If I have a key/value backend that support transaction, range queries
> > (and I don?t need any explicit caching etc.) and I want to replace
> > filestore (and leveldb omap) with that,  which interface you recommend
> > me to derive from , directly ObjectStore or  KeyValueDB ?
> >
> > I have already integrated this backend by deriving from ObjectStore
> > interfaces earlier (pre keyvalueinteface days) but not tested
> > thoroughly enough to see what functionality is broken (Basic
> > functionalities of RGW/RBD are working fine).
> >
> > Basically, I want to know what are the advantages (and disadvantages)
> > of deriving it from the new key/value interfaces ?
> >
> > Also, what state is it in ? Is it feature complete and supporting all
> > the ObjectStore interfaces like clone and all ?
>
> Everything is supported, I think, for perhaps some IO hints that don't make 
> sense in a k/v context.  The big things that you get by using KeyValueStore 
> and plugging into the lower-level interface are:
>
>  - striping of file data across keys
>  - efficient clone
>  - a zillion smaller methods that aren't conceptually difficult to implement 
> bug tedious and to do so.
>
> The other nice thing about reusing this code is that you can use a leveldb or 
> rocksdb backend as a reference for testing or performance or whatever.
>
> The main thing that will be a challenge going forward, I predict, is making 
> storage of the object byte payload in key/value pairs efficient.  I think 
> KeyValuestore is doing some simple striping, but it will suffer for small 
> overwrites (like 512-byte or 4k writes from an RBD).  There are probably some 
> pretty simple heuristics and tricks that can be done to mitigate the most 
> common patterns, but there is no simple solution since the backends generally 
> don't support partial value updates (I assume yours doesn't either?).  But, 
> any work done here will benefit the other backends too so that would be a 
> win..
>
> sage
>
> 
>
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
>



-- 

Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS

2014-09-11 Thread Alexandre DERUMIER
Hi,
seem that intel s3500 perform a lot better with o_dsync

crucial m550

#fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 
--group_reporting --invalidate=0 --name=ab --sync=1
bw=1249.9KB/s, iops=312

intel s3500
---
fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 
--group_reporting --invalidate=0 --name=ab --sync=1
#bw=41794KB/s, iops=10448

ok, so 30x faster.



For crucial, I'll try to apply the patch from stefan priebe, to ignore flushes 
(as crucial m550 have supercaps)
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-November/035707.html




- Mail original - 

De: "Cedric Lemarchand"  
À: ceph-users@lists.ceph.com 
Envoyé: Jeudi 11 Septembre 2014 21:23:23 
Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K 
IOPS 


Le 11/09/2014 19:33, Cedric Lemarchand a écrit : 
> Le 11/09/2014 08:20, Alexandre DERUMIER a écrit : 
>> Hi Sebastien, 
>> 
>> here my first results with crucial m550 (I'll send result with intel s3500 
>> later): 
>> 
>> - 3 nodes 
>> - dell r620 without expander backplane 
>> - sas controller : lsi LSI 9207 (no hardware raid or cache) 
>> - 2 x E5-2603v2 1.8GHz (4cores) 
>> - 32GB ram 
>> - network : 2xgigabit link lacp + 2xgigabit lacp for cluster replication. 
>> 
>> -os : debian wheezy, with kernel 3.10 
>> 
>> os + ceph mon : 2x intel s3500 100gb linux soft raid 
>> osd : crucial m550 (1TB). 
>> 
>> 
>> 3mon in the ceph cluster, 
>> and 1 osd (journal and datas on same disk) 
>> 
>> 
>> ceph.conf 
>> - 
>> debug_lockdep = 0/0 
>> debug_context = 0/0 
>> debug_crush = 0/0 
>> debug_buffer = 0/0 
>> debug_timer = 0/0 
>> debug_filer = 0/0 
>> debug_objecter = 0/0 
>> debug_rados = 0/0 
>> debug_rbd = 0/0 
>> debug_journaler = 0/0 
>> debug_objectcatcher = 0/0 
>> debug_client = 0/0 
>> debug_osd = 0/0 
>> debug_optracker = 0/0 
>> debug_objclass = 0/0 
>> debug_filestore = 0/0 
>> debug_journal = 0/0 
>> debug_ms = 0/0 
>> debug_monc = 0/0 
>> debug_tp = 0/0 
>> debug_auth = 0/0 
>> debug_finisher = 0/0 
>> debug_heartbeatmap = 0/0 
>> debug_perfcounter = 0/0 
>> debug_asok = 0/0 
>> debug_throttle = 0/0 
>> debug_mon = 0/0 
>> debug_paxos = 0/0 
>> debug_rgw = 0/0 
>> osd_op_threads = 5 
>> filestore_op_threads = 4 
>> 
>> ms_nocrc = true 
>> cephx sign messages = false 
>> cephx require signatures = false 
>> 
>> ms_dispatch_throttle_bytes = 0 
>> 
>> #0.85 
>> throttler_perf_counter = false 
>> filestore_fd_cache_size = 64 
>> filestore_fd_cache_shards = 32 
>> osd_op_num_threads_per_shard = 1 
>> osd_op_num_shards = 25 
>> osd_enable_op_tracker = true 
>> 
>> 
>> 
>> Fio disk 4K benchmark 
>> -- 
>> rand read 4k : fio --filename=/dev/sdb --direct=1 --rw=randread --bs=4k 
>> --iodepth=32 --group_reporting --invalidate=0 --name=abc --ioengine=aio 
>> bw=271755KB/s, iops=67938 
>> 
>> rand write 4k : fio --filename=/dev/sdb --direct=1 --rw=randwrite --bs=4k 
>> --iodepth=32 --group_reporting --invalidate=0 --name=abc --ioengine=aio 
>> bw=228293KB/s, iops=57073 
>> 
>> 
>> 
>> fio osd benchmark (through librbd) 
>> -- 
>> [global] 
>> ioengine=rbd 
>> clientname=admin 
>> pool=test 
>> rbdname=test 
>> invalidate=0 # mandatory 
>> rw=randwrite 
>> rw=randread 
>> bs=4k 
>> direct=1 
>> numjobs=4 
>> group_reporting=1 
>> 
>> [rbd_iodepth32] 
>> iodepth=32 
>> 
>> 
>> 
>> FIREFLY RESULTS 
>>  
>> fio randwrite : bw=5009.6KB/s, iops=1252 
>> 
>> fio randread: bw=37820KB/s, iops=9455 
>> 
>> 
>> 
>> O.85 RESULTS 
>>  
>> 
>> fio randwrite : bw=11658KB/s, iops=2914 
>> 
>> fio randread : bw=38642KB/s, iops=9660 
>> 
>> 
>> 
>> 0.85 + osd_enable_op_tracker=false 
>> --- 
>> fio randwrite : bw=11630KB/s, iops=2907 
>> fio randread : bw=80606KB/s, iops=20151, (cpu 100% - GREAT !) 
>> 
>> 
>> 
>> So, for read, seem that osd_enable_op_tracker is the bottleneck. 
>> 
>> 
>> Now for write, I really don't understand why it's so low. 
>> 
>> 
>> I have done some iostat: 
>> 
>> 
>> FIO directly on /dev/sdb 
>> bw=228293KB/s, iops=57073 
>> 
>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await 
>> w_await svctm %util 
>> sdb 0,00 0,00 0,00 63613,00 0,00 254452,00 8,00 31,24 0,49 0,00 0,49 0,02 
>> 100,00 
>> 
>> 
>> FIO directly on osd through librbd 
>> bw=11658KB/s, iops=2914 
>> 
>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await 
>> w_await svctm %util 
>> sdb 0,00 355,00 0,00 5225,00 0,00 29678,00 11,36 57,63 11,03 0,00 11,03 0,19 
>> 99,70 
>> 
>> 
>> (I don't understand what exactly is %util, 100% in the 2 cases, because 10x 
>> slower with ceph) 
> It would be interesting if you could catch the size of writes on SSD 
> during the bench through librbd (I know nmon can do that) 
Replying to myself ... I ask a bit quickly in the way we already have 
this information (29678 / 5225 = 5,68Ko), but this is irrelevant. 

Cheers 

>> It c

Re: [ceph-users] Regarding key/value interface

2014-09-11 Thread Somnath Roy
Thanks Sage...
Basically, we are doing similar chunking in our current implementation which is 
derived from objectstore. 
Moving to Key/value will save us from that :-)
Also, I was thinking, we may want to do compression (later may be dedupe ?) on 
that Key/value layer as well.

Yes, partial read/write is definitely performance killer for object stores and 
our objectstore is no exception. We need to see how we can counter that.

But, I think these are enough reason for me now to move our implementation to 
the key/value interfaces. 

Regards
Somnath


-Original Message-
From: Sage Weil [mailto:sw...@redhat.com] 
Sent: Thursday, September 11, 2014 6:55 PM
To: Somnath Roy
Cc: Haomai Wang (haomaiw...@gmail.com); ceph-users@lists.ceph.com; 
ceph-de...@vger.kernel.org
Subject: RE: Regarding key/value interface

On Fri, 12 Sep 2014, Somnath Roy wrote:
> Make perfect sense Sage..
> 
> Regarding striping of filedata, You are saying KeyValue interface will do the 
> following for me?
> 
> 1. Say in case of rbd image of order 4 MB, a write request coming to 
> Key/Value interface, it will  chunk the object (say full 4MB) in smaller 
> sizes (configurable ?) and stripe it as multiple key/value pair ?
> 
> 2. Also, while reading it will take care of accumulating and send it back.

Precisely.

A smarter thing we might want to make it do in the future would be to take a 4 
KB write create a new key that logically overwrites part of the larger, say, 
1MB key, and apply it on read.  And maybe give up and rewrite the entire 1MB 
stripe after too many small overwrites have accumulated.  
Something along those lines to reduce the cost of small IOs to large objects.

sage



 > 
> Thanks & Regards
> Somnath
> 
> 
> -Original Message-
> From: Sage Weil [mailto:sw...@redhat.com]
> Sent: Thursday, September 11, 2014 6:31 PM
> To: Somnath Roy
> Cc: Haomai Wang (haomaiw...@gmail.com); ceph-users@lists.ceph.com; 
> ceph-de...@vger.kernel.org
> Subject: Re: Regarding key/value interface
> 
> Hi Somnath,
> 
> On Fri, 12 Sep 2014, Somnath Roy wrote:
> >
> > Hi Sage/Haomai,
> >
> > If I have a key/value backend that support transaction, range 
> > queries (and I don?t need any explicit caching etc.) and I want to 
> > replace filestore (and leveldb omap) with that,  which interface you 
> > recommend me to derive from , directly ObjectStore or  KeyValueDB ?
> >
> > I have already integrated this backend by deriving from ObjectStore 
> > interfaces earlier (pre keyvalueinteface days) but not tested 
> > thoroughly enough to see what functionality is broken (Basic 
> > functionalities of RGW/RBD are working fine).
> >
> > Basically, I want to know what are the advantages (and 
> > disadvantages) of deriving it from the new key/value interfaces ?
> >
> > Also, what state is it in ? Is it feature complete and supporting 
> > all the ObjectStore interfaces like clone and all ?
> 
> Everything is supported, I think, for perhaps some IO hints that don't make 
> sense in a k/v context.  The big things that you get by using KeyValueStore 
> and plugging into the lower-level interface are:
> 
>  - striping of file data across keys
>  - efficient clone
>  - a zillion smaller methods that aren't conceptually difficult to implement 
> bug tedious and to do so.
> 
> The other nice thing about reusing this code is that you can use a leveldb or 
> rocksdb backend as a reference for testing or performance or whatever.
> 
> The main thing that will be a challenge going forward, I predict, is making 
> storage of the object byte payload in key/value pairs efficient.  I think 
> KeyValuestore is doing some simple striping, but it will suffer for small 
> overwrites (like 512-byte or 4k writes from an RBD).  There are probably some 
> pretty simple heuristics and tricks that can be done to mitigate the most 
> common patterns, but there is no simple solution since the backends generally 
> don't support partial value updates (I assume yours doesn't either?).  But, 
> any work done here will benefit the other backends too so that would be a 
> win..
> 
> sage
> 
> 
> 
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Regarding key/value interface

2014-09-11 Thread Sage Weil
On Fri, 12 Sep 2014, Somnath Roy wrote:
> Thanks Sage...
> Basically, we are doing similar chunking in our current implementation which 
> is derived from objectstore. 
> Moving to Key/value will save us from that :-)
> Also, I was thinking, we may want to do compression (later may be dedupe ?) 
> on that Key/value layer as well.
> 
> Yes, partial read/write is definitely performance killer for object stores 
> and our objectstore is no exception. We need to see how we can counter that.
> 
> But, I think these are enough reason for me now to move our implementation to 
> the key/value interfaces. 

Sounds good.

By the way, hopefully this is a pretty painless process of wrapping your 
kv library with the KeyValueDB interface.  If not, that will be good to 
know.  I'm hoping it will fit well with a broad range of backends, but so 
far we've only done leveldb/rocksdb (same interface) and kinetic.  I'd 
like to see us try LMDB in this context as well...

sage

> 
> Regards
> Somnath
> 
> 
> -Original Message-
> From: Sage Weil [mailto:sw...@redhat.com] 
> Sent: Thursday, September 11, 2014 6:55 PM
> To: Somnath Roy
> Cc: Haomai Wang (haomaiw...@gmail.com); ceph-users@lists.ceph.com; 
> ceph-de...@vger.kernel.org
> Subject: RE: Regarding key/value interface
> 
> On Fri, 12 Sep 2014, Somnath Roy wrote:
> > Make perfect sense Sage..
> > 
> > Regarding striping of filedata, You are saying KeyValue interface will do 
> > the following for me?
> > 
> > 1. Say in case of rbd image of order 4 MB, a write request coming to 
> > Key/Value interface, it will  chunk the object (say full 4MB) in smaller 
> > sizes (configurable ?) and stripe it as multiple key/value pair ?
> > 
> > 2. Also, while reading it will take care of accumulating and send it back.
> 
> Precisely.
> 
> A smarter thing we might want to make it do in the future would be to take a 
> 4 KB write create a new key that logically overwrites part of the larger, 
> say, 1MB key, and apply it on read.  And maybe give up and rewrite the entire 
> 1MB stripe after too many small overwrites have accumulated.  
> Something along those lines to reduce the cost of small IOs to large objects.
> 
> sage
> 
> 
> 
>  > 
> > Thanks & Regards
> > Somnath
> > 
> > 
> > -Original Message-
> > From: Sage Weil [mailto:sw...@redhat.com]
> > Sent: Thursday, September 11, 2014 6:31 PM
> > To: Somnath Roy
> > Cc: Haomai Wang (haomaiw...@gmail.com); ceph-users@lists.ceph.com; 
> > ceph-de...@vger.kernel.org
> > Subject: Re: Regarding key/value interface
> > 
> > Hi Somnath,
> > 
> > On Fri, 12 Sep 2014, Somnath Roy wrote:
> > >
> > > Hi Sage/Haomai,
> > >
> > > If I have a key/value backend that support transaction, range 
> > > queries (and I don?t need any explicit caching etc.) and I want to 
> > > replace filestore (and leveldb omap) with that,  which interface you 
> > > recommend me to derive from , directly ObjectStore or  KeyValueDB ?
> > >
> > > I have already integrated this backend by deriving from ObjectStore 
> > > interfaces earlier (pre keyvalueinteface days) but not tested 
> > > thoroughly enough to see what functionality is broken (Basic 
> > > functionalities of RGW/RBD are working fine).
> > >
> > > Basically, I want to know what are the advantages (and 
> > > disadvantages) of deriving it from the new key/value interfaces ?
> > >
> > > Also, what state is it in ? Is it feature complete and supporting 
> > > all the ObjectStore interfaces like clone and all ?
> > 
> > Everything is supported, I think, for perhaps some IO hints that don't make 
> > sense in a k/v context.  The big things that you get by using KeyValueStore 
> > and plugging into the lower-level interface are:
> > 
> >  - striping of file data across keys
> >  - efficient clone
> >  - a zillion smaller methods that aren't conceptually difficult to 
> > implement bug tedious and to do so.
> > 
> > The other nice thing about reusing this code is that you can use a leveldb 
> > or rocksdb backend as a reference for testing or performance or whatever.
> > 
> > The main thing that will be a challenge going forward, I predict, is making 
> > storage of the object byte payload in key/value pairs efficient.  I think 
> > KeyValuestore is doing some simple striping, but it will suffer for small 
> > overwrites (like 512-byte or 4k writes from an RBD).  There are probably 
> > some pretty simple heuristics and tricks that can be done to mitigate the 
> > most common patterns, but there is no simple solution since the backends 
> > generally don't support partial value updates (I assume yours doesn't 
> > either?).  But, any work done here will benefit the other backends too so 
> > that would be a win..
> > 
> > sage
> > 
> > 
> > 
> > PLEASE NOTE: The information contained in this electronic mail message is 
> > intended only for the use of the designated recipient(s) named above. If 
> > the reader of this message is not the intended recipient, you are he

Re: [ceph-users] Regarding key/value interface

2014-09-11 Thread Somnath Roy
Hi Haomai,

> Make perfect sense Sage..
>
> Regarding striping of filedata, You are saying KeyValue interface will do the 
> following for me?
>
> 1. Say in case of rbd image of order 4 MB, a write request coming to 
> Key/Value interface, it will  chunk the object (say full 4MB) in smaller 
> sizes (configurable ?) and stripe it as multiple key/value pair ?


Yes, and the stripe size can be configurated.

[Somnath] That's great, thanks

>
>
> 2. Also, while reading it will take care of accumulating and send it back.



Do you have any other idea? 

[Somnath] No, I was just asking

By the way, could you tell more about your key/value interface. I'm doing some 
jobs for NVMe interface with intel NVMe SSD.

[Somnath] It has the following interfaces.

1. Init & shutdown

2.  It has container concept

3.  Read/write objects, delete objects, enumerate objects, multi put/get support

4. Transaction semantics

5. Range query support

6. Container level snapshot

7.  statistics

Let me know if you need anything specifics.

Thanks & Regards
Somnath

>
>
>
> Thanks & Regards
> Somnath
>
>
> -Original Message-
> From: Sage Weil [mailto:sw...@redhat.com]
> Sent: Thursday, September 11, 2014 6:31 PM
> To: Somnath Roy
> Cc: Haomai Wang (haomaiw...@gmail.com); ceph-users@lists.ceph.com; 
> ceph-de...@vger.kernel.org
> Subject: Re: Regarding key/value interface
>
> Hi Somnath,
>
> On Fri, 12 Sep 2014, Somnath Roy wrote:
> >
> > Hi Sage/Haomai,
> >
> > If I have a key/value backend that support transaction, range 
> > queries (and I don?t need any explicit caching etc.) and I want to 
> > replace filestore (and leveldb omap) with that,  which interface you 
> > recommend me to derive from , directly ObjectStore or  KeyValueDB ?
> >
> > I have already integrated this backend by deriving from ObjectStore 
> > interfaces earlier (pre keyvalueinteface days) but not tested 
> > thoroughly enough to see what functionality is broken (Basic 
> > functionalities of RGW/RBD are working fine).
> >
> > Basically, I want to know what are the advantages (and 
> > disadvantages) of deriving it from the new key/value interfaces ?
> >
> > Also, what state is it in ? Is it feature complete and supporting 
> > all the ObjectStore interfaces like clone and all ?
>
> Everything is supported, I think, for perhaps some IO hints that don't make 
> sense in a k/v context.  The big things that you get by using KeyValueStore 
> and plugging into the lower-level interface are:
>
>  - striping of file data across keys
>  - efficient clone
>  - a zillion smaller methods that aren't conceptually difficult to implement 
> bug tedious and to do so.
>
> The other nice thing about reusing this code is that you can use a leveldb or 
> rocksdb backend as a reference for testing or performance or whatever.
>
> The main thing that will be a challenge going forward, I predict, is making 
> storage of the object byte payload in key/value pairs efficient.  I think 
> KeyValuestore is doing some simple striping, but it will suffer for small 
> overwrites (like 512-byte or 4k writes from an RBD).  There are probably some 
> pretty simple heuristics and tricks that can be done to mitigate the most 
> common patterns, but there is no simple solution since the backends generally 
> don't support partial value updates (I assume yours doesn't either?).  But, 
> any work done here will benefit the other backends too so that would be a 
> win..
>
> sage
>
> 
>
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
>



-- 

Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph object back up details

2014-09-11 Thread M Ranga Swami Reddy
Thank you.
On Sep 8, 2014 9:21 PM, "Yehuda Sadeh"  wrote:

> Not sure I understand what you ask. Multiple zones within the same
> region configuration is described here:
>
>
> http://ceph.com/docs/master/radosgw/federated-config/#multi-site-data-replication
>
> Yehuda
>
> On Sun, Sep 7, 2014 at 10:32 PM, M Ranga Swami Reddy
>  wrote:
> > Hi Yahuda,
> > I need more info on Ceph object backup mechanism.. Could  please share
> > a related doc or link for this?
> > Thanks
> > Swami
> >
> > On Thu, Sep 4, 2014 at 10:58 PM, M Ranga Swami Reddy
> >  wrote:
> >> Hi,
> >> I need more info on Ceph object backup mechanism.. Could someone share a
> >> related doc or link for this?
> >> Thanks
> >> Swami
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS

2014-09-11 Thread Alexandre DERUMIER
>>For crucial, I'll try to apply the patch from stefan priebe, to ignore 
>>flushes (as crucial m550 have supercaps) 
>>http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-November/035707.html 
Here the results, disable cache flush

crucial m550

#fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 
--group_reporting --invalidate=0 --name=ab --sync=1
bw=177575KB/s, iops=44393 


- Mail original - 

De: "Alexandre DERUMIER"  
À: "Cedric Lemarchand"  
Cc: ceph-users@lists.ceph.com 
Envoyé: Vendredi 12 Septembre 2014 04:55:21 
Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K 
IOPS 

Hi, 
seem that intel s3500 perform a lot better with o_dsync 

crucial m550 
 
#fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 
--group_reporting --invalidate=0 --name=ab --sync=1 
bw=1249.9KB/s, iops=312 

intel s3500 
--- 
fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 
--group_reporting --invalidate=0 --name=ab --sync=1 
#bw=41794KB/s, iops=10448 

ok, so 30x faster. 



For crucial, I have try to apply the patch from stefan priebe, to ignore 
flushes (as crucial m550 have supercaps) 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-November/035707.html 
Coming from zfs, this sound like "zfs_nocacheflush"

Now results:

crucial m550 
 
#fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 
--group_reporting --invalidate=0 --name=ab --sync=1 
bw=177575KB/s, iops=44393  



fio rbd crucial m550 1 osd 0.85 (osd_enable_op_tracker true or false, same 
result):
---
bw=12327KB/s, iops=3081

So no much better than before, but this time, iostat show only 15% utils, and 
latencies are lower

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sdb   0,0029,000,00 3075,00 0,00 36748,5023,90 
0,290,100,000,10   0,05  15,20


So, the write bottleneck seem to be in ceph.



I will send s3500 result today

- Mail original - 

De: "Cedric Lemarchand"  
À: ceph-users@lists.ceph.com 
Envoyé: Jeudi 11 Septembre 2014 21:23:23 
Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K 
IOPS 


Le 11/09/2014 19:33, Cedric Lemarchand a écrit : 
> Le 11/09/2014 08:20, Alexandre DERUMIER a écrit : 
>> Hi Sebastien, 
>> 
>> here my first results with crucial m550 (I'll send result with intel s3500 
>> later): 
>> 
>> - 3 nodes 
>> - dell r620 without expander backplane 
>> - sas controller : lsi LSI 9207 (no hardware raid or cache) 
>> - 2 x E5-2603v2 1.8GHz (4cores) 
>> - 32GB ram 
>> - network : 2xgigabit link lacp + 2xgigabit lacp for cluster replication. 
>> 
>> -os : debian wheezy, with kernel 3.10 
>> 
>> os + ceph mon : 2x intel s3500 100gb linux soft raid 
>> osd : crucial m550 (1TB). 
>> 
>> 
>> 3mon in the ceph cluster, 
>> and 1 osd (journal and datas on same disk) 
>> 
>> 
>> ceph.conf 
>> - 
>> debug_lockdep = 0/0 
>> debug_context = 0/0 
>> debug_crush = 0/0 
>> debug_buffer = 0/0 
>> debug_timer = 0/0 
>> debug_filer = 0/0 
>> debug_objecter = 0/0 
>> debug_rados = 0/0 
>> debug_rbd = 0/0 
>> debug_journaler = 0/0 
>> debug_objectcatcher = 0/0 
>> debug_client = 0/0 
>> debug_osd = 0/0 
>> debug_optracker = 0/0 
>> debug_objclass = 0/0 
>> debug_filestore = 0/0 
>> debug_journal = 0/0 
>> debug_ms = 0/0 
>> debug_monc = 0/0 
>> debug_tp = 0/0 
>> debug_auth = 0/0 
>> debug_finisher = 0/0 
>> debug_heartbeatmap = 0/0 
>> debug_perfcounter = 0/0 
>> debug_asok = 0/0 
>> debug_throttle = 0/0 
>> debug_mon = 0/0 
>> debug_paxos = 0/0 
>> debug_rgw = 0/0 
>> osd_op_threads = 5 
>> filestore_op_threads = 4 
>> 
>> ms_nocrc = true 
>> cephx sign messages = false 
>> cephx require signatures = false 
>> 
>> ms_dispatch_throttle_bytes = 0 
>> 
>> #0.85 
>> throttler_perf_counter = false 
>> filestore_fd_cache_size = 64 
>> filestore_fd_cache_shards = 32 
>> osd_op_num_threads_per_shard = 1 
>> osd_op_num_shards = 25 
>> osd_enable_op_tracker = true 
>> 
>> 
>> 
>> Fio disk 4K benchmark 
>> -- 
>> rand read 4k : fio --filename=/dev/sdb --direct=1 --rw=randread --bs=4k 
>> --iodepth=32 --group_reporting --invalidate=0 --name=abc --ioengine=aio 
>> bw=271755KB/s, iops=67938 
>> 
>> rand write 4k : fio --filename=/dev/sdb --direct=1 --rw=randwrite --bs=4k 
>> --iodepth=32 --group_reporting --invalidate=0 --name=abc --ioengine=aio 
>> bw=228293KB/s, iops=57073 
>> 
>> 
>> 
>> fio osd benchmark (through librbd) 
>> -- 
>> [global] 
>> ioengine=rbd 
>> clientname=admin 
>> pool=test 
>> rbdname=test 
>> invalidate=0 # mandatory 
>> rw=randwrite 
>> rw=randread 
>> bs=4k 
>> direct=1 
>> numjobs=4 
>> group_reporting=1 
>> 
>> [rbd_iodepth32] 
>> iodepth=32 
>> 
>> 
>> 
>> FIREFLY RESULTS 
>>  
>> fio randwrite : bw=5009.6KB/s, iops=1252 
>> 
>> fio randread: bw=37820KB

Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS

2014-09-11 Thread Alexandre DERUMIER
results of fio on rbd with kernel patch



fio rbd crucial m550 1 osd 0.85 (osd_enable_op_tracker true or false, same 
result): 
--- 
bw=12327KB/s, iops=3081 

So no much better than before, but this time, iostat show only 15% utils, and 
latencies are lower 

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await 
w_await svctm %util 
sdb 0,00 29,00 0,00 3075,00 0,00 36748,50 23,90 0,29 0,10 0,00 0,10 0,05 15,20 


So, the write bottleneck seem to be in ceph. 



I will send s3500 result today 

- Mail original - 

De: "Alexandre DERUMIER"  
À: "Cedric Lemarchand"  
Cc: ceph-users@lists.ceph.com 
Envoyé: Vendredi 12 Septembre 2014 07:58:05 
Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K 
IOPS 

>>For crucial, I'll try to apply the patch from stefan priebe, to ignore 
>>flushes (as crucial m550 have supercaps) 
>>http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-November/035707.html 
Here the results, disable cache flush 

crucial m550 
 
#fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 
--group_reporting --invalidate=0 --name=ab --sync=1 
bw=177575KB/s, iops=44393 


- Mail original - 

De: "Alexandre DERUMIER"  
À: "Cedric Lemarchand"  
Cc: ceph-users@lists.ceph.com 
Envoyé: Vendredi 12 Septembre 2014 04:55:21 
Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K 
IOPS 

Hi, 
seem that intel s3500 perform a lot better with o_dsync 

crucial m550 
 
#fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 
--group_reporting --invalidate=0 --name=ab --sync=1 
bw=1249.9KB/s, iops=312 

intel s3500 
--- 
fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 
--group_reporting --invalidate=0 --name=ab --sync=1 
#bw=41794KB/s, iops=10448 

ok, so 30x faster. 



For crucial, I have try to apply the patch from stefan priebe, to ignore 
flushes (as crucial m550 have supercaps) 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-November/035707.html 
Coming from zfs, this sound like "zfs_nocacheflush" 

Now results: 

crucial m550 
 
#fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 
--group_reporting --invalidate=0 --name=ab --sync=1 
bw=177575KB/s, iops=44393 



fio rbd crucial m550 1 osd 0.85 (osd_enable_op_tracker true or false, same 
result): 
--- 
bw=12327KB/s, iops=3081 

So no much better than before, but this time, iostat show only 15% utils, and 
latencies are lower 

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await 
w_await svctm %util 
sdb 0,00 29,00 0,00 3075,00 0,00 36748,50 23,90 0,29 0,10 0,00 0,10 0,05 15,20 


So, the write bottleneck seem to be in ceph. 



I will send s3500 result today 

- Mail original - 

De: "Cedric Lemarchand"  
À: ceph-users@lists.ceph.com 
Envoyé: Jeudi 11 Septembre 2014 21:23:23 
Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K 
IOPS 


Le 11/09/2014 19:33, Cedric Lemarchand a écrit : 
> Le 11/09/2014 08:20, Alexandre DERUMIER a écrit : 
>> Hi Sebastien, 
>> 
>> here my first results with crucial m550 (I'll send result with intel s3500 
>> later): 
>> 
>> - 3 nodes 
>> - dell r620 without expander backplane 
>> - sas controller : lsi LSI 9207 (no hardware raid or cache) 
>> - 2 x E5-2603v2 1.8GHz (4cores) 
>> - 32GB ram 
>> - network : 2xgigabit link lacp + 2xgigabit lacp for cluster replication. 
>> 
>> -os : debian wheezy, with kernel 3.10 
>> 
>> os + ceph mon : 2x intel s3500 100gb linux soft raid 
>> osd : crucial m550 (1TB). 
>> 
>> 
>> 3mon in the ceph cluster, 
>> and 1 osd (journal and datas on same disk) 
>> 
>> 
>> ceph.conf 
>> - 
>> debug_lockdep = 0/0 
>> debug_context = 0/0 
>> debug_crush = 0/0 
>> debug_buffer = 0/0 
>> debug_timer = 0/0 
>> debug_filer = 0/0 
>> debug_objecter = 0/0 
>> debug_rados = 0/0 
>> debug_rbd = 0/0 
>> debug_journaler = 0/0 
>> debug_objectcatcher = 0/0 
>> debug_client = 0/0 
>> debug_osd = 0/0 
>> debug_optracker = 0/0 
>> debug_objclass = 0/0 
>> debug_filestore = 0/0 
>> debug_journal = 0/0 
>> debug_ms = 0/0 
>> debug_monc = 0/0 
>> debug_tp = 0/0 
>> debug_auth = 0/0 
>> debug_finisher = 0/0 
>> debug_heartbeatmap = 0/0 
>> debug_perfcounter = 0/0 
>> debug_asok = 0/0 
>> debug_throttle = 0/0 
>> debug_mon = 0/0 
>> debug_paxos = 0/0 
>> debug_rgw = 0/0 
>> osd_op_threads = 5 
>> filestore_op_threads = 4 
>> 
>> ms_nocrc = true 
>> cephx sign messages = false 
>> cephx require signatures = false 
>> 
>> ms_dispatch_throttle_bytes = 0 
>> 
>> #0.85 
>> throttler_perf_counter = false 
>> filestore_fd_cache_size = 64 
>> filestore_fd_cache_shards = 32 
>> osd_op_num_threads_per_shard = 1 
>> osd_op_num_shards = 25 
>> osd_enable_op_tracker = true 
>> 
>> 
>> 
>> Fio disk 4K benchmark 
>> -- 
>> rand read 4k : fio --filename=/dev/sdb --direct=1 --rw=randread --bs=4k 
>> --iodepth=32 --group_repor

[ceph-users] help: a newbie question

2014-09-11 Thread brandon li
Hi,

I am new to ceph file system, and have got a newbie question:

For a sparse file, how could ceph file system know the hole in the file was
never created or some stripe was just simply lost?

Thanks,
Brandon
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com