Re: [ceph-users] Did maximum performance reached?

2015-07-28 Thread Karan Singh
Hi

What type of clients do you have.

- Are they Linux physical OR VM mounting Ceph RBD or CephFS ??
- Or they are simply openstack / cloud instances using Ceph as cinder volumes 
or something like that ??


- Karan -

 On 28 Jul 2015, at 11:53, Shneur Zalman Mattern shz...@eimsys.co.il wrote:
 
 We've built Ceph cluster:
 3 mon nodes (one of them is combined with mds)
 3 osd nodes (each one have 10 osd + 2 ssd for journaling)
 switch 24 ports x 10G
 10 gigabit - for public network
 20 gigabit bonding - between osds 
 Ubuntu 12.04.05
 Ceph 0.87.2
 -
 Clients has:
 10 gigabit for ceph-connection
 CentOS 6.6 with kernel 3.19.8 equipped by cephfs-kmodule 
 
 
 
 == fio-2.0.13 seqwrite, bs=1M, filesize=10G, parallel-jobs=16 ===
 Single client:
 
 
 Starting 16 processes
 
 .below is just 1 job info
 trivial-readwrite-grid01: (groupid=0, jobs=1): err= 0: pid=10484: Tue Jul 28 
 13:26:24 2015
   write: io=10240MB, bw=78656KB/s, iops=76 , runt=133312msec
 slat (msec): min=1 , max=117 , avg=13.01, stdev=12.57
 clat (usec): min=1 , max=68 , avg= 3.61, stdev= 1.99
  lat (msec): min=1 , max=117 , avg=13.01, stdev=12.57
 clat percentiles (usec):
  |  1.00th=[1],  5.00th=[2], 10.00th=[2], 20.00th=[2],
  | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4],
  | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6],
  | 99.00th=[9], 99.50th=[   10], 99.90th=[   23], 99.95th=[   28],
  | 99.99th=[   62]
 bw (KB/s)  : min=35790, max=318215, per=6.31%, avg=78816.91, 
 stdev=26397.76
 lat (usec) : 2=1.33%, 4=54.43%, 10=43.54%, 20=0.56%, 50=0.11%
 lat (usec) : 100=0.03%
   cpu  : usr=0.89%, sys=12.85%, ctx=58248, majf=0, minf=9
   IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0
 
 ...what's above repeats 16 times... 
 
 Run status group 0 (all jobs):
   WRITE: io=163840MB, aggrb=1219.8MB/s, minb=78060KB/s, maxb=78655KB/s, 
 mint=133312msec, maxt=134329msec
 
 +
 Two clients:
 +
 below is just 1 job info
 trivial-readwrite-gridsrv: (groupid=0, jobs=1): err= 0: pid=10605: Tue Jul 28 
 14:05:59 2015
   write: io=10240MB, bw=43154KB/s, iops=42 , runt=242984msec
 slat (usec): min=991 , max=285653 , avg=23716.12, stdev=23960.60
 clat (usec): min=1 , max=65 , avg= 3.67, stdev= 2.02
  lat (usec): min=994 , max=285664 , avg=23723.39, stdev=23962.22
 clat percentiles (usec):
  |  1.00th=[2],  5.00th=[2], 10.00th=[2], 20.00th=[2],
  | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4],
  | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6],
  | 99.00th=[8], 99.50th=[   10], 99.90th=[   28], 99.95th=[   37],
  | 99.99th=[   56]
 bw (KB/s)  : min=20630, max=276480, per=6.30%, avg=43328.34, 
 stdev=21905.92
 lat (usec) : 2=0.84%, 4=49.45%, 10=49.13%, 20=0.37%, 50=0.18%
 lat (usec) : 100=0.03%
   cpu  : usr=0.49%, sys=5.68%, ctx=31428, majf=0, minf=9
   IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0
 
 ...what's above repeats 16 times... 
 
 Run status group 0 (all jobs):
   WRITE: io=163840MB, aggrb=687960KB/s, minb=42997KB/s, maxb=43270KB/s, 
 mint=242331msec, maxt=243869msec
 
 - And almost the same(?!) aggregated result from the second client: 
 -
 
 Run status group 0 (all jobs):
   WRITE: io=163840MB, aggrb=679401KB/s, minb=42462KB/s, maxb=42852KB/s, 
 mint=244697msec, maxt=246941msec
 
 - If I'll summarize: -
 aggrb1 + aggrb2 = 687960KB/s + 679401KB/s = 1367MB/s 
 
 it looks like the same bandwidth from just one client aggrb=1219.8MB/s and it 
 was divided? why?
 Question: If I'll connect 12 clients nodes - each one can write just on 
 100MB/s?
 Perhaps, I need to scale out our ceph up to 15(how many?) OSD nodes - and 
 it'll serve 2 clients on the 1.3GB/s (bw of 10gig nic), or not? 
 
 
 
 health HEALTH_OK
  monmap e1: 3 mons at 
 {mon1=192.168.56.251:6789/0,mon2=192.168.56.252:6789/0,mon3=192.168.56.253:6789/0},
  election epoch 140, quorum 0,1,2 mon1,mon2,mon3
  mdsmap e12: 1/1/1 up {0=mon3=up:active}
  osdmap e832: 31 osds: 30 up, 30 in
   pgmap v106186: 6144 pgs, 3 pools, 2306 GB 

[ceph-users] Did maximum performance reached?

2015-07-28 Thread Shneur Zalman Mattern
We've built Ceph cluster:
3 mon nodes (one of them is combined with mds)
3 osd nodes (each one have 10 osd + 2 ssd for journaling)
switch 24 ports x 10G
10 gigabit - for public network
20 gigabit bonding - between osds
Ubuntu 12.04.05
Ceph 0.87.2
-
Clients has:
10 gigabit for ceph-connection
CentOS 6.6 with kernel 3.19.8 equipped by cephfs-kmodule



== fio-2.0.13 seqwrite, bs=1M, filesize=10G, parallel-jobs=16 ===
Single client:


Starting 16 processes

.below is just 1 job info
trivial-readwrite-grid01: (groupid=0, jobs=1): err= 0: pid=10484: Tue Jul 28 
13:26:24 2015
  write: io=10240MB, bw=78656KB/s, iops=76 , runt=133312msec
slat (msec): min=1 , max=117 , avg=13.01, stdev=12.57
clat (usec): min=1 , max=68 , avg= 3.61, stdev= 1.99
 lat (msec): min=1 , max=117 , avg=13.01, stdev=12.57
clat percentiles (usec):
 |  1.00th=[1],  5.00th=[2], 10.00th=[2], 20.00th=[2],
 | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4],
 | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6],
 | 99.00th=[9], 99.50th=[   10], 99.90th=[   23], 99.95th=[   28],
 | 99.99th=[   62]
bw (KB/s)  : min=35790, max=318215, per=6.31%, avg=78816.91, stdev=26397.76
lat (usec) : 2=1.33%, 4=54.43%, 10=43.54%, 20=0.56%, 50=0.11%
lat (usec) : 100=0.03%
  cpu  : usr=0.89%, sys=12.85%, ctx=58248, majf=0, minf=9
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0%
 issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0

...what's above repeats 16 times...

Run status group 0 (all jobs):
  WRITE: io=163840MB, aggrb=1219.8MB/s, minb=78060KB/s, maxb=78655KB/s, 
mint=133312msec, maxt=134329msec

+
Two clients:
+
below is just 1 job info
trivial-readwrite-gridsrv: (groupid=0, jobs=1): err= 0: pid=10605: Tue Jul 28 
14:05:59 2015
  write: io=10240MB, bw=43154KB/s, iops=42 , runt=242984msec
slat (usec): min=991 , max=285653 , avg=23716.12, stdev=23960.60
clat (usec): min=1 , max=65 , avg= 3.67, stdev= 2.02
 lat (usec): min=994 , max=285664 , avg=23723.39, stdev=23962.22
clat percentiles (usec):
 |  1.00th=[2],  5.00th=[2], 10.00th=[2], 20.00th=[2],
 | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4],
 | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6],
 | 99.00th=[8], 99.50th=[   10], 99.90th=[   28], 99.95th=[   37],
 | 99.99th=[   56]
bw (KB/s)  : min=20630, max=276480, per=6.30%, avg=43328.34, stdev=21905.92
lat (usec) : 2=0.84%, 4=49.45%, 10=49.13%, 20=0.37%, 50=0.18%
lat (usec) : 100=0.03%
  cpu  : usr=0.49%, sys=5.68%, ctx=31428, majf=0, minf=9
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0%
 issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0

...what's above repeats 16 times...

Run status group 0 (all jobs):
  WRITE: io=163840MB, aggrb=687960KB/s, minb=42997KB/s, maxb=43270KB/s, 
mint=242331msec, maxt=243869msec

- And almost the same(?!) aggregated result from the second client: 
-

Run status group 0 (all jobs):
  WRITE: io=163840MB, aggrb=679401KB/s, minb=42462KB/s, maxb=42852KB/s, 
mint=244697msec, maxt=246941msec

- If I'll summarize: -
aggrb1 + aggrb2 = 687960KB/s + 679401KB/s = 1367MB/s

it looks like the same bandwidth from just one client aggrb=1219.8MB/s and it 
was divided? why?
Question: If I'll connect 12 clients nodes - each one can write just on 100MB/s?
Perhaps, I need to scale out our ceph up to 15(how many?) OSD nodes - and it'll 
serve 2 clients on the 1.3GB/s (bw of 10gig nic), or not?



health HEALTH_OK
 monmap e1: 3 mons at 
{mon1=192.168.56.251:6789/0,mon2=192.168.56.252:6789/0,mon3=192.168.56.253:6789/0},
 election epoch 140, quorum 0,1,2 mon1,mon2,mon3
 mdsmap e12: 1/1/1 up {0=mon3=up:active}
 osdmap e832: 31 osds: 30 up, 30 in
  pgmap v106186: 6144 pgs, 3 pools, 2306 GB data, 1379 kobjects
4624 GB used, 104 TB / 109 TB avail
6144 active+clean


Perhaps, I don't understand something in Ceph architecture? I thought, that:

Each spindel-disk can write ~ 100MB/s , and we have 10 SAS disks on each node = 
aggregated write speed is ~ 900MB/s (because of striping etc.)
And we have 3 OSD nodes, and objects are striped also on 30 osds - I thought 
it's also 

Re: [ceph-users] State of nfs-ganesha CEPH fsal

2015-07-28 Thread Burkhard Linke

Hi,

On 07/28/2015 11:08 AM, Haomai Wang wrote:

On Tue, Jul 28, 2015 at 4:47 PM, Gregory Farnum g...@gregs42.com wrote:

On Tue, Jul 28, 2015 at 8:01 AM, Burkhard Linke
burkhard.li...@computational.bio.uni-giessen.de wrote:


*snipsnap*
Can you give some details on that issues? I'm currently looking for 
a way to provide NFS based access to CephFS to our desktop machines. 

Ummm...sadly I can't; we don't appear to have any tracker tickets and
I'm not sure where the report went to. :( I think it was from
Haomai...

My fault, I should report this to ticket.

I have forgotten the details about the problem, I submit the infos to IRC :-(

It related to the ls output. It will print the wrong user/group
owner as -1, maybe related to root squash?
Are you sure this problem is related to the CephFS FSAL? I also had a 
hard time setting up ganesha correctly, especially with respect to user 
and group mappings, especially with a kerberized setup.


I'm currently running a small test setup with one server and one client 
to single out the last kerberos related problems (nfs-ganesha 2.2.0 / 
Ceph Hammer 0.94.2 / Ubuntu 14.04). User/group listings have been OK so 
far. Do you remember whether the problem occurs every time or just 
arbitrarily?


Best regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Did maximum performance reached?

2015-07-28 Thread Shneur Zalman Mattern
Hi,

But my question is why speed is divided between clients?
And how much OSDnodes, OSDdaemos, PGs, I have to add/remove to ceph,
that each cephfs-client could write with his max network speed (10Gbit/s ~ 
1.2GB/s)???



From: Johannes Formann mlm...@formann.de
Sent: Tuesday, July 28, 2015 12:46 PM
To: Shneur Zalman Mattern
Subject: Re: [ceph-users] Did maximum performance reached?

Hi,

size=3 would decrease your performance. But with size=2 your results are not 
bad too:
Math:
size=2 means each write is written 4 times (2 copies, first journal, later 
disk). Calculating with 1.300MB/s „Client“ Bandwidth that means:

2 (size) * 1300 MB/s / 6 (SSD) = 433MB/s each SSD
2 (size) * 1300 MB/s / 30 (HDD) = 87MB/s each HDD


greetings

Johannes

 Am 28.07.2015 um 11:41 schrieb Shneur Zalman Mattern shz...@eimsys.co.il:

 Hi, Johannes (that's my grandpa's name)

 The size is 2, do you really think that number of replicas can increase 
 performance?
 on the  http://ceph.com/docs/master/architecture/
 written Note: Striping is independent of object replicas. Since CRUSH 
 replicates objects across OSDs, stripes get replicated automatically. 

 OK, I'll check it,
 Regards, Shneur
 
 From: Johannes Formann mlm...@formann.de
 Sent: Tuesday, July 28, 2015 12:09 PM
 To: Shneur Zalman Mattern
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Did maximum performance reached?

 Hello,

 what is the „size“ parameter of your pool?

 Some math do show the impact:
 size=3 means each write is written 6 times (3 copies, first journal, later 
 disk). Calculating with 1.300MB/s „Client“ Bandwidth that means:

 3 (size) * 1300 MB/s / 6 (SSD) = 650MB/s per SSD
 3 (size) * 1300 MB/s / 30 (HDD) = 130MB/s per HDD

 If you use size=3, the results are as good as one can expect. (Even with 
 size=2 the results won’t be bad)

 greetings

 Johannes

 Am 28.07.2015 um 10:53 schrieb Shneur Zalman Mattern shz...@eimsys.co.il:

 We've built Ceph cluster:
3 mon nodes (one of them is combined with mds)
3 osd nodes (each one have 10 osd + 2 ssd for journaling)
switch 24 ports x 10G
10 gigabit - for public network
20 gigabit bonding - between osds
Ubuntu 12.04.05
Ceph 0.87.2
 -
 Clients has:
10 gigabit for ceph-connection
CentOS 6.6 with kernel 3.19.8 equipped by cephfs-kmodule



 == fio-2.0.13 seqwrite, bs=1M, filesize=10G, parallel-jobs=16 ===
 Single client:
 

 Starting 16 processes

 .below is just 1 job info
 trivial-readwrite-grid01: (groupid=0, jobs=1): err= 0: pid=10484: Tue Jul 28 
 13:26:24 2015
  write: io=10240MB, bw=78656KB/s, iops=76 , runt=133312msec
slat (msec): min=1 , max=117 , avg=13.01, stdev=12.57
clat (usec): min=1 , max=68 , avg= 3.61, stdev= 1.99
 lat (msec): min=1 , max=117 , avg=13.01, stdev=12.57
clat percentiles (usec):
 |  1.00th=[1],  5.00th=[2], 10.00th=[2], 20.00th=[2],
 | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4],
 | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6],
 | 99.00th=[9], 99.50th=[   10], 99.90th=[   23], 99.95th=[   28],
 | 99.99th=[   62]
bw (KB/s)  : min=35790, max=318215, per=6.31%, avg=78816.91, 
 stdev=26397.76
lat (usec) : 2=1.33%, 4=54.43%, 10=43.54%, 20=0.56%, 50=0.11%
lat (usec) : 100=0.03%
  cpu  : usr=0.89%, sys=12.85%, ctx=58248, majf=0, minf=9
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
 issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0

 ...what's above repeats 16 times...

 Run status group 0 (all jobs):
  WRITE: io=163840MB, aggrb=1219.8MB/s, minb=78060KB/s, maxb=78655KB/s, 
 mint=133312msec, maxt=134329msec

 +
 Two clients:
 +
 below is just 1 job info
 trivial-readwrite-gridsrv: (groupid=0, jobs=1): err= 0: pid=10605: Tue Jul 
 28 14:05:59 2015
  write: io=10240MB, bw=43154KB/s, iops=42 , runt=242984msec
slat (usec): min=991 , max=285653 , avg=23716.12, stdev=23960.60
clat (usec): min=1 , max=65 , avg= 3.67, stdev= 2.02
 lat (usec): min=994 , max=285664 , avg=23723.39, stdev=23962.22
clat percentiles (usec):
 |  1.00th=[2],  5.00th=[2], 10.00th=[2], 20.00th=[2],
 | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4],
 | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6],
 | 99.00th=[8], 99.50th=[   10], 99.90th=[   28], 99.95th=[   37],
 | 99.99th=[   56]
bw (KB/s)  : min=20630, max=276480, per=6.30%, avg=43328.34, 
 stdev=21905.92
lat (usec) : 2=0.84%, 4=49.45%, 10=49.13%, 20=0.37%, 

[ceph-users] Did maximum performance reached?

2015-07-28 Thread Shneur Zalman Mattern
Hi!

And so, in your math
I need to build size = osd, 30 replicas for my cluster of 120TB - to get my 
demans 
And 4TB real storage capacity in price 3000$ per 1TB? Joke?

All the best,
Shneur

From: Johannes Formann mlm...@formann.de
Sent: Tuesday, July 28, 2015 12:46 PM
To: Shneur Zalman Mattern
Subject: Re: [ceph-users] Did maximum performance reached?

Hi,

size=3 would decrease your performance. But with size=2 your results are not 
bad too:
Math:
size=2 means each write is written 4 times (2 copies, first journal, later 
disk). Calculating with 1.300MB/s „Client“ Bandwidth that means:

2 (size) * 1300 MB/s / 6 (SSD) = 433MB/s each SSD
2 (size) * 1300 MB/s / 30 (HDD) = 87MB/s each HDD


greetings

Johannes

 
 

This footnote confirms that this email message has been scanned by
PineApp Mail-SeCure for the presence of malicious code, vandals  computer 
viruses.




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] State of nfs-ganesha CEPH fsal

2015-07-28 Thread Burkhard Linke

Hi,

On 07/27/2015 05:42 PM, Gregory Farnum wrote:

On Mon, Jul 27, 2015 at 4:33 PM, Burkhard Linke
burkhard.li...@computational.bio.uni-giessen.de wrote:

Hi,

the nfs-ganesha documentation states:

... This FSAL links to a modified version of the CEPH library that has been
extended to expose its distributed cluster and replication facilities to the
pNFS operations in the FSAL. ... The CEPH library modifications have not
been merged into the upstream yet. 

(https://github.com/nfs-ganesha/nfs-ganesha/wiki/Fsalsupport#ceph)

Is this still the case with the hammer release?

The FSAL has been upstream for quite a while, but it's not part of our
regular testing yet and I'm not sure what it gets from the Ganesha
side. I'd encourage you to test it, but be wary — we had a recent
report of some issues we haven't been able to set up to reproduce yet.
Can you give some details on that issues? I'm currently looking for a 
way to provide NFS based access to CephFS to our desktop machines.


The kernel NFS implementation in Ubuntu had some problems with CephFS in 
our setup, which I was not able to resolve yet. Ganesha seems to be more 
promising, since it uses libcephfs directly and does not need a 
mountpoint of its own.


Best regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] State of nfs-ganesha CEPH fsal

2015-07-28 Thread Haomai Wang
On Tue, Jul 28, 2015 at 5:28 PM, Burkhard Linke
burkhard.li...@computational.bio.uni-giessen.de wrote:
 Hi,

 On 07/28/2015 11:08 AM, Haomai Wang wrote:

 On Tue, Jul 28, 2015 at 4:47 PM, Gregory Farnum g...@gregs42.com wrote:

 On Tue, Jul 28, 2015 at 8:01 AM, Burkhard Linke
 burkhard.li...@computational.bio.uni-giessen.de wrote:


 *snipsnap*

 Can you give some details on that issues? I'm currently looking for a
 way to provide NFS based access to CephFS to our desktop machines.

 Ummm...sadly I can't; we don't appear to have any tracker tickets and
 I'm not sure where the report went to. :( I think it was from
 Haomai...

 My fault, I should report this to ticket.

 I have forgotten the details about the problem, I submit the infos to IRC
 :-(

 It related to the ls output. It will print the wrong user/group
 owner as -1, maybe related to root squash?

 Are you sure this problem is related to the CephFS FSAL? I also had a hard
 time setting up ganesha correctly, especially with respect to user and group
 mappings, especially with a kerberized setup.

 I'm currently running a small test setup with one server and one client to
 single out the last kerberos related problems (nfs-ganesha 2.2.0 / Ceph
 Hammer 0.94.2 / Ubuntu 14.04). User/group listings have been OK so far. Do
 you remember whether the problem occurs every time or just arbitrarily?


Great!

I'm not sure the reason. I guess it may related to nfs-ganesha version
or client distro version.

 Best regards,
 Burkhard
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Did maximum performance reached?

2015-07-28 Thread Shneur Zalman Mattern
Hi, Karan!

That's physical CentOS clients of CephFS mounted by kernel-module (kernel 4.1.3)

Thanks

Hi

What type of clients do you have.

- Are they Linux physical OR VM mounting Ceph RBD or CephFS ??
- Or they are simply openstack / cloud instances using Ceph as cinder volumes 
or something like that ??


- Karan -

 On 28 Jul 2015, at 11:53, Shneur Zalman Mattern shz...@eimsys.co.il wrote:

 We've built Ceph cluster:
3 mon nodes (one of them is combined with mds)
 3 osd nodes (each one have 10 osd + 2 ssd for journaling)
 switch 24 ports x 10G
 10 gigabit - for public network
 20 gigabit bonding - between osds
 Ubuntu 12.04.05
 Ceph 0.87.2
 -
 Clients has:
 10 gigabit for ceph-connection
 CentOS 6.6 with kernel 4.1.3 equipped by cephfs-kmodule



 
 

This footnote confirms that this email message has been scanned by
PineApp Mail-SeCure for the presence of malicious code, vandals  computer 
viruses.




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Did maximum performance reached?

2015-07-28 Thread Shneur Zalman Mattern
Oh, now I've to cry :-)
not because it's not SSDs... it's SAS2 HDDs

Because, I need to build something for 140 clients... 4200 OSDs

:-(

Looks like, I can pickup my performance by SSDs, but I need a huge capacity ~ 
2PB 
Perhaps, tiering cache pool can save my money, but I've read here - that it's 
slower than all people think...

:-(

Why Lustre is more performable? There're same HDDs?

 
 

This footnote confirms that this email message has been scanned by
PineApp Mail-SeCure for the presence of malicious code, vandals  computer 
viruses.




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] which kernel version can help avoid kernel client deadlock

2015-07-28 Thread van
Hi, Ilya,

  Thanks for your quick reply.

  Here is the link http://ceph.com/docs/cuttlefish/faq/ 
http://ceph.com/docs/cuttlefish/faq/  , under the HOW CAN I GIVE CEPH A 
TRY?” section which talk about the old kernel stuff.

  By the way, what’s the main reason of using kernel 4.1, is there a lot of 
critical bugs fixed in that version despite perf improvements?
  I am worrying kernel 4.1 is too new that may introduce other problems.
  And if I’m using the librdb API, is the kernel version matters?

  In my tests, I built a 2-nodes cluster, each with only one OSD with os centos 
7.1, kernel version 3.10.0.229 and ceph v0.94.2.
  I created several rbds and mkfs.xfs on those rbds to create filesystems. 
(kernel client were running on the ceph cluster)
  I performed heavy IO tests on those filesystems and found some fio got hung 
and turned into D state forever (uninterruptible sleep).
  I suspect it’s the deadlock that make the fio process hung.
  However the ceph-osd are stil responsive, and I can operate rbd via librbd 
API.
  Does this mean it’s not the loopback mount deadlock that cause the fio 
process hung?
  Or it is also a deadlock phnonmenon, only one thread is blocked in memory 
allocation and other threads are still possible to receive API requests, so the 
ceph-osd are still responsive?
 
  What worth mentioning is that after I restart the ceph-osd daemon, all 
processes in D state come back into normal state.

  Below is related log in kernel:

Jul  7 02:25:39 node0 kernel: INFO: task xfsaild/rbd1:24795 blocked for more 
than 120 seconds.
Jul  7 02:25:39 node0 kernel: echo 0  
/proc/sys/kernel/hung_task_timeout_secs disables this message.
Jul  7 02:25:39 node0 kernel: xfsaild/rbd1D 880c2fc13680 0 24795
  2 0x0080
Jul  7 02:25:39 node0 kernel: 8801d6343d40 0046 
8801d6343fd8 00013680
Jul  7 02:25:39 node0 kernel: 8801d6343fd8 00013680 
880c0c0b 880c0c0b
Jul  7 02:25:39 node0 kernel: 880c2fc14340 0001 
 8805bace2528
Jul  7 02:25:39 node0 kernel: Call Trace:
Jul  7 02:25:39 node0 kernel: [81609e39] schedule+0x29/0x70
Jul  7 02:25:39 node0 kernel: [a03a1890] _xfs_log_force+0x230/0x290 
[xfs]
Jul  7 02:25:39 node0 kernel: [810a9620] ? wake_up_state+0x20/0x20
Jul  7 02:25:39 node0 kernel: [a03a1916] xfs_log_force+0x26/0x80 [xfs]
Jul  7 02:25:39 node0 kernel: [a03a6390] ? 
xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
Jul  7 02:25:39 node0 kernel: [a03a64e1] xfsaild+0x151/0x5e0 [xfs]
Jul  7 02:25:39 node0 kernel: [a03a6390] ? 
xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
Jul  7 02:25:39 node0 kernel: [8109739f] kthread+0xcf/0xe0
Jul  7 02:25:39 node0 kernel: [810972d0] ? 
kthread_create_on_node+0x140/0x140
Jul  7 02:25:39 node0 kernel: [8161497c] ret_from_fork+0x7c/0xb0
Jul  7 02:25:39 node0 kernel: [810972d0] ? 
kthread_create_on_node+0x140/0x140
Jul  7 02:25:39 node0 kernel: INFO: task xfsaild/rbd5:2914 blocked for more 
than 120 seconds.

  Does anyone encounter the same problem or could help with this?

  Thanks. 
  
 
 On Jul 28, 2015, at 3:01 PM, Ilya Dryomov idryo...@gmail.com wrote:
 
 On Tue, Jul 28, 2015 at 9:17 AM, van chaofa...@owtware.com wrote:
 Hi, list,
 
  I found on the ceph FAQ that, ceph kernel client should not run on
 machines belong to ceph cluster.
  As ceph FAQ metioned, “In older kernels, Ceph can deadlock if you try to
 mount CephFS or RBD client services on the same host that runs your test
 Ceph cluster. This is not a Ceph-related issue.”
  Here it says that there will be deadlock if using old kernel version.
  I wonder if anyone knows which new kernel version solve this loopback
 mount deadlock.
  It will be a great help since I do need to use rbd kernel client on the
 ceph cluster.
 
 Note that doing this is *not* recommended.  That said, if you don't
 push your system to its knees too hard, it should work.  I'm not sure
 what exactly constitutes and older kernel as per that FAQ (as you
 haven't even linked it), but even if I knew, I'd still suggest 4.1.


 
 
  As I search more informations, I found two articals
 https://lwn.net/Articles/595652/ and https://lwn.net/Articles/596618/  talk
 about supporting nfs loopback mount,it seems they do effort not on memory
 management only, but also on nfs related codes, I wonder if ceph has also so
 some effort on kernel client to solve this problem. If ceph did, could
 anyone help provide the kernel version with the patch?
 
 There wasn't any specific effort on the ceph side, but we do try not to
 break it: sometime around 3.18 a ceph patch was merged that made it
 impossible to do co-locate kernel client with OSDs; once we realized
 that, the culprit patch was reverted and the revert was backported.
 
 So the bottom line is we don't recommend it, but we try not to break
 your ability to do it ;)
 
 Thanks,
 
Ilya


[ceph-users] wrong documentation in add or rm mons

2015-07-28 Thread Makkelie, R (ITCDCC) - KLM
i followed the following documentation to add monitors to my already existing 
cluster with 1 mon
http://ceph.com/docs/master/rados/operations/add-or-rm-mons/

when i follow this documentation.
the monitor assimilates the old monitor so my monitor status is gone.

but when i skip the ceph mon add mon-id ip[:port] part
it adds the monitor and all works well.

this issue also happens with ceph-deploy mon add

so i think the documentation is not correct
can someone confirm this?

greetz
Ramonskie

For information, services and offers, please visit our web site: 
http://www.klm.com. This e-mail and any attachment may contain confidential and 
privileged material intended for the addressee only. If you are not the 
addressee, you are notified that no part of the e-mail or any attachment may be 
disclosed, copied or distributed, and that any other action related to this 
e-mail or attachment is strictly prohibited, and may be unlawful. If you have 
received this e-mail by error, please notify the sender immediately by return 
e-mail, and delete this message.

Koninklijke Luchtvaart Maatschappij NV (KLM), its subsidiaries and/or its 
employees shall not be liable for the incorrect or incomplete transmission of 
this e-mail or any attachments, nor responsible for any delay in receipt.
Koninklijke Luchtvaart Maatschappij N.V. (also known as KLM Royal Dutch 
Airlines) is registered in Amstelveen, The Netherlands, with registered number 
33014286

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Weird behaviour of cephfs with samba

2015-07-28 Thread Gregory Farnum
On Mon, Jul 27, 2015 at 6:25 PM, Jörg Henne henn...@gmail.com wrote:
 Gregory Farnum greg@... writes:

 Yeah, I think there were some directory listing bugs in that version
 that Samba is probably running into. They're fixed in a newer kernel
 release (I'm not sure which one exactly, sorry).

 Ok, thanks, good to know!

  and then detaches itself but the mountpoint stays empty no matter what.
  /var/log/ceph/ceph-client.admin.log isn't enlighting as well. I've never
  used a FUSE before, though, so I might be overlooking something.

 Uh, that's odd. What do you mean it's empty no matter what? Is the
 ceph-fuse process actually still running?

 Yes, e.g.

  8525 pts/0Sl 0:00 ceph-fuse -m 10.208.66.1:6789 /mnt/regtest2

 But

 root@gru:/mnt# ls /mnt/regtest2 | wc -l
 0

 With the kernel module I mount just a subpath of the cephfs space like in

 /etc/fstab:
 my_monhost:/regression-test /mnt/regtest ...

 which ceph-fuse doesn't seem to support, but then I would expect
 regression-test to simply be a sub-directory of /mnt/regtest2.

You can mount subtrees with the -r option to ceph-fuse.

Once you've started it up you should find a file like
client.admin.[0-9]*.asok in (I think?) /var/run/ceph. You can run
ceph --admin-daemon /var/run/ceph/{client_asok} status and provide
the output to see if it's doing anything useful. Or set debug client
= 20 in the config and then upload the client log file either
publicly or with ceph-post-file and I'll take a quick look to see
what's going on.
-Greg


 (You should also be able to talk to Ceph directly via the Samba
 daemon; the bindings are in upstream Samba although you probably need
 to install one of the Ceph packages to make it work. That's the way we
 test in our nightlies.)

 Indeed, it seems like something is missing:

 [2015/07/27 19:21:40.080572,  0] ../lib/util/modules.c:48(load_module)
   Error loading module '/usr/lib/x86_64-linux-gnu/samba/vfs/ceph.so':
 /usr/lib/x86_64-linux-gnu/samba/vfs/ceph.so: cannot open shared object file:
 No such file or directory

Mmm, that looks like a Samba config issue which unfortunately I don't
know much about. Perhaps you need to install these modules
individually? It looks like our nightly tests are just getting the
Ceph VFS installed by default. :/
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] hadoop on ceph

2015-07-28 Thread Gregory Farnum
On Mon, Jul 27, 2015 at 6:34 PM, Patrick McGarry pmcga...@redhat.com wrote:
 Moving this to the ceph-user list where it has a better chance of
 being answered.



 On Mon, Jul 27, 2015 at 5:35 AM, jingxia@baifendian.com
 jingxia@baifendian.com wrote:
 Dear ,
 I have questions to ask.
 The doc says hadoop on ceph but requires Hadoop 1.1.X stable series
 I want to know if CephFS Hadoop plugin can be used by Hadoop 2.6.0  now or
 it is not support Hadoop2.6.0 and still being developed?
 If Ceph can not be used by Hadoop2.6.0,then i want to know when it will can
 be used and is there a team to developing it?
 I use Hadoop 1.1.2 on ceph is ok, but when hadoop 2.6.0 use ceph,there is
 something wrong and hdfs is still on.

The current Hadoop plugin we test with should run against Hadoop 2.
There are a couple of different versions floating around so maybe you
managed to grab the old one?
But in any case the Ceph plugin has very little to do with whether
HDFS gets started or not; that's all in your configuration steps and
scripts.

Development on the Hadoop integration is pretty sporadic but it runs
in our nightlies so we notice if it breaks.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD RAM usage values

2015-07-28 Thread Kenneth Waegeman



On 07/17/2015 02:50 PM, Gregory Farnum wrote:

On Fri, Jul 17, 2015 at 1:13 PM, Kenneth Waegeman
kenneth.waege...@ugent.be wrote:

Hi all,

I've read in the documentation that OSDs use around 512MB on a healthy
cluster.(http://ceph.com/docs/master/start/hardware-recommendations/#ram)
Now, our OSD's are all using around 2GB of RAM memory while the cluster is
healthy.


   PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
29784 root  20   0 6081276 2.535g   4740 S   0.7  8.1   1346:55 ceph-osd
32818 root  20   0 5417212 2.164g  24780 S  16.2  6.9   1238:55 ceph-osd
25053 root  20   0 5386604 2.159g  27864 S   0.7  6.9   1192:08 ceph-osd
33875 root  20   0 5345288 2.092g   3544 S   0.7  6.7   1188:53 ceph-osd
30779 root  20   0 5474832 2.090g  28892 S   1.0  6.7   1142:29 ceph-osd
22068 root  20   0 5191516 2.000g  28664 S   0.7  6.4  31:56.72 ceph-osd
34932 root  20   0 5242656 1.994g   4536 S   0.3  6.4   1144:48 ceph-osd
26883 root  20   0 5178164 1.938g   6164 S   0.3  6.2   1173:01 ceph-osd
31796 root  20   0 5193308 1.916g  27000 S  16.2  6.1 923:14.87 ceph-osd
25958 root  20   0 5193436 1.901g   2900 S   0.7  6.1   1039:53 ceph-osd
27826 root  20   0 5225764 1.845g   5576 S   1.0  5.9   1031:15 ceph-osd
36011 root  20   0 5111660 1.823g  20512 S  15.9  5.8   1093:01 ceph-osd
19736 root  20   0 2134680 0.994g  0 S   0.3  3.2  46:13.47 ceph-osd



[root@osd003 ~]# ceph status
2015-07-17 14:03:13.865063 7f1fde5f0700 -1 WARNING: the following dangerous
and experimental features are enabled: keyvaluestore
2015-07-17 14:03:13.887087 7f1fde5f0700 -1 WARNING: the following dangerous
and experimental features are enabled: keyvaluestore
 cluster 92bfcf0a-1d39-43b3-b60f-44f01b630e47
  health HEALTH_OK
  monmap e1: 3 mons at
{mds01=10.141.16.1:6789/0,mds02=10.141.16.2:6789/0,mds03=10.141.16.3:6789/0}
 election epoch 58, quorum 0,1,2 mds01,mds02,mds03
  mdsmap e17218: 1/1/1 up {0=mds03=up:active}, 1 up:standby
  osdmap e25542: 258 osds: 258 up, 258 in
   pgmap v2460163: 4160 pgs, 4 pools, 228 TB data, 154 Mobjects
 270 TB used, 549 TB / 819 TB avail
 4152 active+clean
8 active+clean+scrubbing+deep


We are using erasure code on most of our OSDs, so maybe that is a reason.
But also the cache-pool filestore OSDS on 200GB SSDs are using 2GB of RAM.
Our erasure code pool (16*14 osds) have a pg_num of 2048; our cache pool
(2*14 OSDS) has a pg_num of 1024.

Are these normal values for this configuration, and is the documentation a
bit outdated, or should we look into something else?


2GB of RSS is larger than I would have expected, but not unreasonable.
In particular I don't think we've gathered numbers on either EC pools
or on the effects of the caching processes.


Which data is actually in memory of the OSDS?
Is this mostly cached data?
We are short on memory on these servers, can we have influence on this?

Thanks again!
Kenneth


-Greg


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Did maximum performance reached?

2015-07-28 Thread Shneur Zalman Mattern
Hi, Johannes (that's my grandpa's name)

The size is 2, do you really think that number of replicas can increase 
performance?
on the  http://ceph.com/docs/master/architecture/
written Note: Striping is independent of object replicas. Since CRUSH 
replicates objects across OSDs, stripes get replicated automatically. 

OK, I'll check it,
Regards, Shneur

From: Johannes Formann mlm...@formann.de
Sent: Tuesday, July 28, 2015 12:09 PM
To: Shneur Zalman Mattern
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Did maximum performance reached?

Hello,

what is the „size“ parameter of your pool?

Some math do show the impact:
size=3 means each write is written 6 times (3 copies, first journal, later 
disk). Calculating with 1.300MB/s „Client“ Bandwidth that means:

3 (size) * 1300 MB/s / 6 (SSD) = 650MB/s per SSD
3 (size) * 1300 MB/s / 30 (HDD) = 130MB/s per HDD

If you use size=3, the results are as good as one can expect. (Even with size=2 
the results won’t be bad)

greetings

Johannes

 Am 28.07.2015 um 10:53 schrieb Shneur Zalman Mattern shz...@eimsys.co.il:

 We've built Ceph cluster:
 3 mon nodes (one of them is combined with mds)
 3 osd nodes (each one have 10 osd + 2 ssd for journaling)
 switch 24 ports x 10G
 10 gigabit - for public network
 20 gigabit bonding - between osds
 Ubuntu 12.04.05
 Ceph 0.87.2
 -
 Clients has:
 10 gigabit for ceph-connection
 CentOS 6.6 with kernel 3.19.8 equipped by cephfs-kmodule



 == fio-2.0.13 seqwrite, bs=1M, filesize=10G, parallel-jobs=16 ===
 Single client:
 

 Starting 16 processes

 .below is just 1 job info
 trivial-readwrite-grid01: (groupid=0, jobs=1): err= 0: pid=10484: Tue Jul 28 
 13:26:24 2015
   write: io=10240MB, bw=78656KB/s, iops=76 , runt=133312msec
 slat (msec): min=1 , max=117 , avg=13.01, stdev=12.57
 clat (usec): min=1 , max=68 , avg= 3.61, stdev= 1.99
  lat (msec): min=1 , max=117 , avg=13.01, stdev=12.57
 clat percentiles (usec):
  |  1.00th=[1],  5.00th=[2], 10.00th=[2], 20.00th=[2],
  | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4],
  | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6],
  | 99.00th=[9], 99.50th=[   10], 99.90th=[   23], 99.95th=[   28],
  | 99.99th=[   62]
 bw (KB/s)  : min=35790, max=318215, per=6.31%, avg=78816.91, 
 stdev=26397.76
 lat (usec) : 2=1.33%, 4=54.43%, 10=43.54%, 20=0.56%, 50=0.11%
 lat (usec) : 100=0.03%
   cpu  : usr=0.89%, sys=12.85%, ctx=58248, majf=0, minf=9
   IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0

 ...what's above repeats 16 times...

 Run status group 0 (all jobs):
   WRITE: io=163840MB, aggrb=1219.8MB/s, minb=78060KB/s, maxb=78655KB/s, 
 mint=133312msec, maxt=134329msec

 +
 Two clients:
 +
 below is just 1 job info
 trivial-readwrite-gridsrv: (groupid=0, jobs=1): err= 0: pid=10605: Tue Jul 28 
 14:05:59 2015
   write: io=10240MB, bw=43154KB/s, iops=42 , runt=242984msec
 slat (usec): min=991 , max=285653 , avg=23716.12, stdev=23960.60
 clat (usec): min=1 , max=65 , avg= 3.67, stdev= 2.02
  lat (usec): min=994 , max=285664 , avg=23723.39, stdev=23962.22
 clat percentiles (usec):
  |  1.00th=[2],  5.00th=[2], 10.00th=[2], 20.00th=[2],
  | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4],
  | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6],
  | 99.00th=[8], 99.50th=[   10], 99.90th=[   28], 99.95th=[   37],
  | 99.99th=[   56]
 bw (KB/s)  : min=20630, max=276480, per=6.30%, avg=43328.34, 
 stdev=21905.92
 lat (usec) : 2=0.84%, 4=49.45%, 10=49.13%, 20=0.37%, 50=0.18%
 lat (usec) : 100=0.03%
   cpu  : usr=0.49%, sys=5.68%, ctx=31428, majf=0, minf=9
   IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0

 ...what's above repeats 16 times...

 Run status group 0 (all jobs):
   WRITE: io=163840MB, aggrb=687960KB/s, minb=42997KB/s, maxb=43270KB/s, 
 mint=242331msec, maxt=243869msec

 - And almost the same(?!) aggregated result from the second client: 
 -

 Run status group 0 (all jobs):
   WRITE: io=163840MB, aggrb=679401KB/s, minb=42462KB/s, maxb=42852KB/s, 
 mint=244697msec, maxt=246941msec

 - If I'll 

Re: [ceph-users] Did maximum performance reached?

2015-07-28 Thread Johannes Formann
Hello,

what is the „size“ parameter of your pool?

Some math do show the impact:
size=3 means each write is written 6 times (3 copies, first journal, later 
disk). Calculating with 1.300MB/s „Client“ Bandwidth that means:

3 (size) * 1300 MB/s / 6 (SSD) = 650MB/s per SSD
3 (size) * 1300 MB/s / 30 (HDD) = 130MB/s per HDD

If you use size=3, the results are as good as one can expect. (Even with size=2 
the results won’t be bad)

greetings

Johannes

 Am 28.07.2015 um 10:53 schrieb Shneur Zalman Mattern shz...@eimsys.co.il:
 
 We've built Ceph cluster:
 3 mon nodes (one of them is combined with mds)
 3 osd nodes (each one have 10 osd + 2 ssd for journaling)
 switch 24 ports x 10G
 10 gigabit - for public network
 20 gigabit bonding - between osds 
 Ubuntu 12.04.05
 Ceph 0.87.2
 -
 Clients has:
 10 gigabit for ceph-connection
 CentOS 6.6 with kernel 3.19.8 equipped by cephfs-kmodule 
 
 
 
 == fio-2.0.13 seqwrite, bs=1M, filesize=10G, parallel-jobs=16 ===
 Single client:
 
 
 Starting 16 processes
 
 .below is just 1 job info
 trivial-readwrite-grid01: (groupid=0, jobs=1): err= 0: pid=10484: Tue Jul 28 
 13:26:24 2015
   write: io=10240MB, bw=78656KB/s, iops=76 , runt=133312msec
 slat (msec): min=1 , max=117 , avg=13.01, stdev=12.57
 clat (usec): min=1 , max=68 , avg= 3.61, stdev= 1.99
  lat (msec): min=1 , max=117 , avg=13.01, stdev=12.57
 clat percentiles (usec):
  |  1.00th=[1],  5.00th=[2], 10.00th=[2], 20.00th=[2],
  | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4],
  | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6],
  | 99.00th=[9], 99.50th=[   10], 99.90th=[   23], 99.95th=[   28],
  | 99.99th=[   62]
 bw (KB/s)  : min=35790, max=318215, per=6.31%, avg=78816.91, 
 stdev=26397.76
 lat (usec) : 2=1.33%, 4=54.43%, 10=43.54%, 20=0.56%, 50=0.11%
 lat (usec) : 100=0.03%
   cpu  : usr=0.89%, sys=12.85%, ctx=58248, majf=0, minf=9
   IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0
 
 ...what's above repeats 16 times... 
 
 Run status group 0 (all jobs):
   WRITE: io=163840MB, aggrb=1219.8MB/s, minb=78060KB/s, maxb=78655KB/s, 
 mint=133312msec, maxt=134329msec
 
 +
 Two clients:
 +
 below is just 1 job info
 trivial-readwrite-gridsrv: (groupid=0, jobs=1): err= 0: pid=10605: Tue Jul 28 
 14:05:59 2015
   write: io=10240MB, bw=43154KB/s, iops=42 , runt=242984msec
 slat (usec): min=991 , max=285653 , avg=23716.12, stdev=23960.60
 clat (usec): min=1 , max=65 , avg= 3.67, stdev= 2.02
  lat (usec): min=994 , max=285664 , avg=23723.39, stdev=23962.22
 clat percentiles (usec):
  |  1.00th=[2],  5.00th=[2], 10.00th=[2], 20.00th=[2],
  | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4],
  | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6],
  | 99.00th=[8], 99.50th=[   10], 99.90th=[   28], 99.95th=[   37],
  | 99.99th=[   56]
 bw (KB/s)  : min=20630, max=276480, per=6.30%, avg=43328.34, 
 stdev=21905.92
 lat (usec) : 2=0.84%, 4=49.45%, 10=49.13%, 20=0.37%, 50=0.18%
 lat (usec) : 100=0.03%
   cpu  : usr=0.49%, sys=5.68%, ctx=31428, majf=0, minf=9
   IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0
 
 ...what's above repeats 16 times... 
 
 Run status group 0 (all jobs):
   WRITE: io=163840MB, aggrb=687960KB/s, minb=42997KB/s, maxb=43270KB/s, 
 mint=242331msec, maxt=243869msec
 
 - And almost the same(?!) aggregated result from the second client: 
 -
 
 Run status group 0 (all jobs):
   WRITE: io=163840MB, aggrb=679401KB/s, minb=42462KB/s, maxb=42852KB/s, 
 mint=244697msec, maxt=246941msec
 
 - If I'll summarize: -
 aggrb1 + aggrb2 = 687960KB/s + 679401KB/s = 1367MB/s 
 
 it looks like the same bandwidth from just one client aggrb=1219.8MB/s and it 
 was divided? why?
 Question: If I'll connect 12 clients nodes - each one can write just on 
 100MB/s?
 Perhaps, I need to scale out our ceph up to 15(how many?) OSD nodes - and 
 it'll serve 2 clients on the 1.3GB/s (bw of 10gig nic), or not? 
 
 
 
 health HEALTH_OK
  monmap e1: 3 mons at 
 

Re: [ceph-users] OSD RAM usage values

2015-07-28 Thread Gregory Farnum
On Tue, Jul 28, 2015 at 11:00 AM, Kenneth Waegeman
kenneth.waege...@ugent.be wrote:


 On 07/17/2015 02:50 PM, Gregory Farnum wrote:

 On Fri, Jul 17, 2015 at 1:13 PM, Kenneth Waegeman
 kenneth.waege...@ugent.be wrote:

 Hi all,

 I've read in the documentation that OSDs use around 512MB on a healthy
 cluster.(http://ceph.com/docs/master/start/hardware-recommendations/#ram)
 Now, our OSD's are all using around 2GB of RAM memory while the cluster
 is
 healthy.


PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+
 COMMAND
 29784 root  20   0 6081276 2.535g   4740 S   0.7  8.1   1346:55
 ceph-osd
 32818 root  20   0 5417212 2.164g  24780 S  16.2  6.9   1238:55
 ceph-osd
 25053 root  20   0 5386604 2.159g  27864 S   0.7  6.9   1192:08
 ceph-osd
 33875 root  20   0 5345288 2.092g   3544 S   0.7  6.7   1188:53
 ceph-osd
 30779 root  20   0 5474832 2.090g  28892 S   1.0  6.7   1142:29
 ceph-osd
 22068 root  20   0 5191516 2.000g  28664 S   0.7  6.4  31:56.72
 ceph-osd
 34932 root  20   0 5242656 1.994g   4536 S   0.3  6.4   1144:48
 ceph-osd
 26883 root  20   0 5178164 1.938g   6164 S   0.3  6.2   1173:01
 ceph-osd
 31796 root  20   0 5193308 1.916g  27000 S  16.2  6.1 923:14.87
 ceph-osd
 25958 root  20   0 5193436 1.901g   2900 S   0.7  6.1   1039:53
 ceph-osd
 27826 root  20   0 5225764 1.845g   5576 S   1.0  5.9   1031:15
 ceph-osd
 36011 root  20   0 5111660 1.823g  20512 S  15.9  5.8   1093:01
 ceph-osd
 19736 root  20   0 2134680 0.994g  0 S   0.3  3.2  46:13.47
 ceph-osd



 [root@osd003 ~]# ceph status
 2015-07-17 14:03:13.865063 7f1fde5f0700 -1 WARNING: the following
 dangerous
 and experimental features are enabled: keyvaluestore
 2015-07-17 14:03:13.887087 7f1fde5f0700 -1 WARNING: the following
 dangerous
 and experimental features are enabled: keyvaluestore
  cluster 92bfcf0a-1d39-43b3-b60f-44f01b630e47
   health HEALTH_OK
   monmap e1: 3 mons at

 {mds01=10.141.16.1:6789/0,mds02=10.141.16.2:6789/0,mds03=10.141.16.3:6789/0}
  election epoch 58, quorum 0,1,2 mds01,mds02,mds03
   mdsmap e17218: 1/1/1 up {0=mds03=up:active}, 1 up:standby
   osdmap e25542: 258 osds: 258 up, 258 in
pgmap v2460163: 4160 pgs, 4 pools, 228 TB data, 154 Mobjects
  270 TB used, 549 TB / 819 TB avail
  4152 active+clean
 8 active+clean+scrubbing+deep


 We are using erasure code on most of our OSDs, so maybe that is a reason.
 But also the cache-pool filestore OSDS on 200GB SSDs are using 2GB of
 RAM.
 Our erasure code pool (16*14 osds) have a pg_num of 2048; our cache pool
 (2*14 OSDS) has a pg_num of 1024.

 Are these normal values for this configuration, and is the documentation
 a
 bit outdated, or should we look into something else?


 2GB of RSS is larger than I would have expected, but not unreasonable.
 In particular I don't think we've gathered numbers on either EC pools
 or on the effects of the caching processes.


 Which data is actually in memory of the OSDS?
 Is this mostly cached data?
 We are short on memory on these servers, can we have influence on this?

Mmm, we've discussed this a few times on the mailing list. The CERN
guys published a document on experimenting with a very large cluster
and not enough RAM, but there's nothing I would really recommend
changing for a production system, especially an EC one, if you aren't
intimately familiar with what's going on.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Configuring MemStore in Ceph

2015-07-28 Thread Haomai Wang
On Wed, Jul 29, 2015 at 10:21 AM, Aakanksha Pudipeddi-SSI
aakanksha...@ssi.samsung.com wrote:
 Hello Haomai,

 I am using v0.94.2.

 Thanks,
 Aakanksha

 -Original Message-
 From: Haomai Wang [mailto:haomaiw...@gmail.com]
 Sent: Tuesday, July 28, 2015 7:20 PM
 To: Aakanksha Pudipeddi-SSI
 Cc: ceph-us...@ceph.com
 Subject: Re: [ceph-users] Configuring MemStore in Ceph

 Which version do you use?

 https://github.com/ceph/ceph/commit/c60f88ba8a6624099f576eaa5f1225c2fcaab41a
 should fix your problem

 On Wed, Jul 29, 2015 at 5:44 AM, Aakanksha Pudipeddi-SSI 
 aakanksha...@ssi.samsung.com wrote:
 Hello,



 I am trying to setup a ceph cluster with a memstore backend. The
 problem is, it is always created with a fixed size (1GB). I made
 changes to the ceph.conf file as follows:



 osd_objectstore = memstore

 memstore_device_bytes = 5*1024*1024*1024



 The resultant cluster still has 1GB allocated to it. Could anybody
 point out what I am doing wrong here?

What's the mean of The resultant cluster still has 1GB allocated to it?

Is it mean that you can't write data more than 1GB?




 Thanks,

 Aakanksha


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




 --
 Best Regards,

 Wheat



-- 
Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Updating OSD Parameters

2015-07-28 Thread Nikhil Mitra (nikmitra)
I believe you can use ceph tell to inject it in a running cluster.

From your admin node you should be able to run
Ceph tell osd.* injectargs --osd_recovery_max_active 1 --osd_max_backfills 1”


Regards,
Nikhil Mitra


From: ceph-users 
ceph-users-boun...@lists.ceph.commailto:ceph-users-boun...@lists.ceph.com 
on behalf of Noah Mehl 
noahm...@combinedpublic.commailto:noahm...@combinedpublic.com
Date: Tuesday, July 28, 2015 at 7:53 AM
To: ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com 
ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com
Subject: [ceph-users] Updating OSD Parameters

When we update the following in ceph.conf:

[osd]
  osd_recovery_max_active = 1
  osd_max_backfills = 1

How do we make sure it takes affect?  Do we have to restart all of the ceph 
osd’s and mon’s?

Thanks!

~Noah

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Configuring MemStore in Ceph

2015-07-28 Thread Haomai Wang
Which version do you use?

https://github.com/ceph/ceph/commit/c60f88ba8a6624099f576eaa5f1225c2fcaab41a
should fix your problem

On Wed, Jul 29, 2015 at 5:44 AM, Aakanksha Pudipeddi-SSI
aakanksha...@ssi.samsung.com wrote:
 Hello,



 I am trying to setup a ceph cluster with a memstore backend. The problem is,
 it is always created with a fixed size (1GB). I made changes to the
 ceph.conf file as follows:



 osd_objectstore = memstore

 memstore_device_bytes = 5*1024*1024*1024



 The resultant cluster still has 1GB allocated to it. Could anybody point out
 what I am doing wrong here?



 Thanks,

 Aakanksha


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




-- 
Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Configuring MemStore in Ceph

2015-07-28 Thread Aakanksha Pudipeddi-SSI
Hello Haomai,

I am using v0.94.2.

Thanks,
Aakanksha

-Original Message-
From: Haomai Wang [mailto:haomaiw...@gmail.com] 
Sent: Tuesday, July 28, 2015 7:20 PM
To: Aakanksha Pudipeddi-SSI
Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] Configuring MemStore in Ceph

Which version do you use?

https://github.com/ceph/ceph/commit/c60f88ba8a6624099f576eaa5f1225c2fcaab41a
should fix your problem

On Wed, Jul 29, 2015 at 5:44 AM, Aakanksha Pudipeddi-SSI 
aakanksha...@ssi.samsung.com wrote:
 Hello,



 I am trying to setup a ceph cluster with a memstore backend. The 
 problem is, it is always created with a fixed size (1GB). I made 
 changes to the ceph.conf file as follows:



 osd_objectstore = memstore

 memstore_device_bytes = 5*1024*1024*1024



 The resultant cluster still has 1GB allocated to it. Could anybody 
 point out what I am doing wrong here?



 Thanks,

 Aakanksha


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] which kernel version can help avoid kernel client deadlock

2015-07-28 Thread van
Hi, Ilya,
  
  In the dmesg, there is also a lot of libceph socket error, which I think may 
be caused by my stopping ceph service without unmap rbd.
  
  Here is a more than 1 lines log contains more info, http://jmp.sh/NcokrfT 
http://jmp.sh/NcokrfT 
  
  Thanks for willing to help.

van
chaofa...@owtware.com



 On Jul 28, 2015, at 7:11 PM, Ilya Dryomov idryo...@gmail.com wrote:
 
 On Tue, Jul 28, 2015 at 11:19 AM, van chaofa...@owtware.com 
 mailto:chaofa...@owtware.com wrote:
 Hi, Ilya,
 
  Thanks for your quick reply.
 
  Here is the link http://ceph.com/docs/cuttlefish/faq/ 
 http://ceph.com/docs/cuttlefish/faq/  , under the HOW
 CAN I GIVE CEPH A TRY?” section which talk about the old kernel stuff.
 
  By the way, what’s the main reason of using kernel 4.1, is there a lot of
 critical bugs fixed in that version despite perf improvements?
  I am worrying kernel 4.1 is too new that may introduce other problems.
 
 Well, I'm not sure what exactly is in 3.10.0.229, so I can't tell you
 off hand.  I can think of one important memory pressure related fix
 that's probably not in there.
 
 I'm suggesting the latest stable version of 4.1 (currently 4.1.3),
 because if you hit a deadlock (remember, this is a configuration that
 is neither recommended nor guaranteed to work), it'll be easier to
 debug and fix if the fix turns out to be worth it.
 
 If 4.1 is not acceptable for you, try the latest stable version of 3.18
 (that is 3.18.19).  It's an LTS kernel, so that should mitigate some of
 your concerns.
 
  And if I’m using the librdb API, is the kernel version matters?
 
 No, not so much.
 
 
  In my tests, I built a 2-nodes cluster, each with only one OSD with os
 centos 7.1, kernel version 3.10.0.229 and ceph v0.94.2.
  I created several rbds and mkfs.xfs on those rbds to create filesystems.
 (kernel client were running on the ceph cluster)
  I performed heavy IO tests on those filesystems and found some fio got
 hung and turned into D state forever (uninterruptible sleep).
  I suspect it’s the deadlock that make the fio process hung.
  However the ceph-osd are stil responsive, and I can operate rbd via librbd
 API.
  Does this mean it’s not the loopback mount deadlock that cause the fio
 process hung?
  Or it is also a deadlock phnonmenon, only one thread is blocked in memory
 allocation and other threads are still possible to receive API requests, so
 the ceph-osd are still responsive?
 
  What worth mentioning is that after I restart the ceph-osd daemon, all
 processes in D state come back into normal state.
 
  Below is related log in kernel:
 
 Jul  7 02:25:39 node0 kernel: INFO: task xfsaild/rbd1:24795 blocked for more
 than 120 seconds.
 Jul  7 02:25:39 node0 kernel: echo 0 
 /proc/sys/kernel/hung_task_timeout_secs disables this message.
 Jul  7 02:25:39 node0 kernel: xfsaild/rbd1D 880c2fc13680 0 24795
 2 0x0080
 Jul  7 02:25:39 node0 kernel: 8801d6343d40 0046
 8801d6343fd8 00013680
 Jul  7 02:25:39 node0 kernel: 8801d6343fd8 00013680
 880c0c0b 880c0c0b
 Jul  7 02:25:39 node0 kernel: 880c2fc14340 0001
  8805bace2528
 Jul  7 02:25:39 node0 kernel: Call Trace:
 Jul  7 02:25:39 node0 kernel: [81609e39] schedule+0x29/0x70
 Jul  7 02:25:39 node0 kernel: [a03a1890]
 _xfs_log_force+0x230/0x290 [xfs]
 Jul  7 02:25:39 node0 kernel: [810a9620] ? wake_up_state+0x20/0x20
 Jul  7 02:25:39 node0 kernel: [a03a1916] xfs_log_force+0x26/0x80
 [xfs]
 Jul  7 02:25:39 node0 kernel: [a03a6390] ?
 xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
 Jul  7 02:25:39 node0 kernel: [a03a64e1] xfsaild+0x151/0x5e0 [xfs]
 Jul  7 02:25:39 node0 kernel: [a03a6390] ?
 xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
 Jul  7 02:25:39 node0 kernel: [8109739f] kthread+0xcf/0xe0
 Jul  7 02:25:39 node0 kernel: [810972d0] ?
 kthread_create_on_node+0x140/0x140
 Jul  7 02:25:39 node0 kernel: [8161497c] ret_from_fork+0x7c/0xb0
 Jul  7 02:25:39 node0 kernel: [810972d0] ?
 kthread_create_on_node+0x140/0x140
 Jul  7 02:25:39 node0 kernel: INFO: task xfsaild/rbd5:2914 blocked for more
 than 120 seconds.
 
 Is that all there is in dmesg?  Can you paste the entire dmesg?
 
 Thanks,
 
Ilya

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] which kernel version can help avoid kernel client deadlock

2015-07-28 Thread Ilya Dryomov
On Tue, Jul 28, 2015 at 9:17 AM, van chaofa...@owtware.com wrote:
 Hi, list,

   I found on the ceph FAQ that, ceph kernel client should not run on
 machines belong to ceph cluster.
   As ceph FAQ metioned, “In older kernels, Ceph can deadlock if you try to
 mount CephFS or RBD client services on the same host that runs your test
 Ceph cluster. This is not a Ceph-related issue.”
   Here it says that there will be deadlock if using old kernel version.
   I wonder if anyone knows which new kernel version solve this loopback
 mount deadlock.
   It will be a great help since I do need to use rbd kernel client on the
 ceph cluster.

Note that doing this is *not* recommended.  That said, if you don't
push your system to its knees too hard, it should work.  I'm not sure
what exactly constitutes and older kernel as per that FAQ (as you
haven't even linked it), but even if I knew, I'd still suggest 4.1.


   As I search more informations, I found two articals
 https://lwn.net/Articles/595652/ and https://lwn.net/Articles/596618/  talk
 about supporting nfs loopback mount,it seems they do effort not on memory
 management only, but also on nfs related codes, I wonder if ceph has also so
 some effort on kernel client to solve this problem. If ceph did, could
 anyone help provide the kernel version with the patch?

There wasn't any specific effort on the ceph side, but we do try not to
break it: sometime around 3.18 a ceph patch was merged that made it
impossible to do co-locate kernel client with OSDs; once we realized
that, the culprit patch was reverted and the revert was backported.

So the bottom line is we don't recommend it, but we try not to break
your ability to do it ;)

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] State of nfs-ganesha CEPH fsal

2015-07-28 Thread Gregory Farnum
On Tue, Jul 28, 2015 at 8:01 AM, Burkhard Linke
burkhard.li...@computational.bio.uni-giessen.de wrote:
 Hi,

 On 07/27/2015 05:42 PM, Gregory Farnum wrote:

 On Mon, Jul 27, 2015 at 4:33 PM, Burkhard Linke
 burkhard.li...@computational.bio.uni-giessen.de wrote:

 Hi,

 the nfs-ganesha documentation states:

 ... This FSAL links to a modified version of the CEPH library that has
 been
 extended to expose its distributed cluster and replication facilities to
 the
 pNFS operations in the FSAL. ... The CEPH library modifications have not
 been merged into the upstream yet. 

 (https://github.com/nfs-ganesha/nfs-ganesha/wiki/Fsalsupport#ceph)

 Is this still the case with the hammer release?

 The FSAL has been upstream for quite a while, but it's not part of our
 regular testing yet and I'm not sure what it gets from the Ganesha
 side. I'd encourage you to test it, but be wary — we had a recent
 report of some issues we haven't been able to set up to reproduce yet.

 Can you give some details on that issues? I'm currently looking for a way to
 provide NFS based access to CephFS to our desktop machines.

Ummm...sadly I can't; we don't appear to have any tracker tickets and
I'm not sure where the report went to. :( I think it was from
Haomai...
-Greg


 The kernel NFS implementation in Ubuntu had some problems with CephFS in our
 setup, which I was not able to resolve yet. Ganesha seems to be more
 promising, since it uses libcephfs directly and does not need a mountpoint
 of its own.

 Best regards,
 Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] State of nfs-ganesha CEPH fsal

2015-07-28 Thread Haomai Wang
On Tue, Jul 28, 2015 at 4:47 PM, Gregory Farnum g...@gregs42.com wrote:
 On Tue, Jul 28, 2015 at 8:01 AM, Burkhard Linke
 burkhard.li...@computational.bio.uni-giessen.de wrote:
 Hi,

 On 07/27/2015 05:42 PM, Gregory Farnum wrote:

 On Mon, Jul 27, 2015 at 4:33 PM, Burkhard Linke
 burkhard.li...@computational.bio.uni-giessen.de wrote:

 Hi,

 the nfs-ganesha documentation states:

 ... This FSAL links to a modified version of the CEPH library that has
 been
 extended to expose its distributed cluster and replication facilities to
 the
 pNFS operations in the FSAL. ... The CEPH library modifications have not
 been merged into the upstream yet. 

 (https://github.com/nfs-ganesha/nfs-ganesha/wiki/Fsalsupport#ceph)

 Is this still the case with the hammer release?

 The FSAL has been upstream for quite a while, but it's not part of our
 regular testing yet and I'm not sure what it gets from the Ganesha
 side. I'd encourage you to test it, but be wary — we had a recent
 report of some issues we haven't been able to set up to reproduce yet.

 Can you give some details on that issues? I'm currently looking for a way to
 provide NFS based access to CephFS to our desktop machines.

 Ummm...sadly I can't; we don't appear to have any tracker tickets and
 I'm not sure where the report went to. :( I think it was from
 Haomai...

My fault, I should report this to ticket.

I have forgotten the details about the problem, I submit the infos to IRC :-(

It related to the ls output. It will print the wrong user/group
owner as -1, maybe related to root squash?

 -Greg


 The kernel NFS implementation in Ubuntu had some problems with CephFS in our
 setup, which I was not able to resolve yet. Ganesha seems to be more
 promising, since it uses libcephfs directly and does not need a mountpoint
 of its own.

 Best regards,
 Burkhard
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Did maximum performance reached?

2015-07-28 Thread Johannes Formann
The speed is divided because ist fair :)
You reach the limit your hardware (I guess the SSDs) can deliver.

For 2 clients each doing 1200 MB/s you’ll have basically to double the amount 
of OSDs.

greetings

Johannes

 Am 28.07.2015 um 11:56 schrieb Shneur Zalman Mattern shz...@eimsys.co.il:
 
 Hi,
 
 But my question is why speed is divided between clients?
 And how much OSDnodes, OSDdaemos, PGs, I have to add/remove to ceph,
 that each cephfs-client could write with his max network speed (10Gbit/s ~ 
 1.2GB/s)???
 
 
 
 From: Johannes Formann mlm...@formann.de
 Sent: Tuesday, July 28, 2015 12:46 PM
 To: Shneur Zalman Mattern
 Subject: Re: [ceph-users] Did maximum performance reached?
 
 Hi,
 
 size=3 would decrease your performance. But with size=2 your results are not 
 bad too:
 Math:
 size=2 means each write is written 4 times (2 copies, first journal, later 
 disk). Calculating with 1.300MB/s „Client“ Bandwidth that means:
 
 2 (size) * 1300 MB/s / 6 (SSD) = 433MB/s each SSD
 2 (size) * 1300 MB/s / 30 (HDD) = 87MB/s each HDD
 
 
 greetings
 
 Johannes
 
 Am 28.07.2015 um 11:41 schrieb Shneur Zalman Mattern shz...@eimsys.co.il:
 
 Hi, Johannes (that's my grandpa's name)
 
 The size is 2, do you really think that number of replicas can increase 
 performance?
 on the  http://ceph.com/docs/master/architecture/
 written Note: Striping is independent of object replicas. Since CRUSH 
 replicates objects across OSDs, stripes get replicated automatically. 
 
 OK, I'll check it,
 Regards, Shneur
 
 From: Johannes Formann mlm...@formann.de
 Sent: Tuesday, July 28, 2015 12:09 PM
 To: Shneur Zalman Mattern
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Did maximum performance reached?
 
 Hello,
 
 what is the „size“ parameter of your pool?
 
 Some math do show the impact:
 size=3 means each write is written 6 times (3 copies, first journal, later 
 disk). Calculating with 1.300MB/s „Client“ Bandwidth that means:
 
 3 (size) * 1300 MB/s / 6 (SSD) = 650MB/s per SSD
 3 (size) * 1300 MB/s / 30 (HDD) = 130MB/s per HDD
 
 If you use size=3, the results are as good as one can expect. (Even with 
 size=2 the results won’t be bad)
 
 greetings
 
 Johannes
 
 Am 28.07.2015 um 10:53 schrieb Shneur Zalman Mattern shz...@eimsys.co.il:
 
 We've built Ceph cluster:
   3 mon nodes (one of them is combined with mds)
   3 osd nodes (each one have 10 osd + 2 ssd for journaling)
   switch 24 ports x 10G
   10 gigabit - for public network
   20 gigabit bonding - between osds
   Ubuntu 12.04.05
   Ceph 0.87.2
 -
 Clients has:
   10 gigabit for ceph-connection
   CentOS 6.6 with kernel 3.19.8 equipped by cephfs-kmodule
 
 
 
 == fio-2.0.13 seqwrite, bs=1M, filesize=10G, parallel-jobs=16 
 ===
 Single client:
 
 
 Starting 16 processes
 
 .below is just 1 job info
 trivial-readwrite-grid01: (groupid=0, jobs=1): err= 0: pid=10484: Tue Jul 
 28 13:26:24 2015
 write: io=10240MB, bw=78656KB/s, iops=76 , runt=133312msec
   slat (msec): min=1 , max=117 , avg=13.01, stdev=12.57
   clat (usec): min=1 , max=68 , avg= 3.61, stdev= 1.99
lat (msec): min=1 , max=117 , avg=13.01, stdev=12.57
   clat percentiles (usec):
|  1.00th=[1],  5.00th=[2], 10.00th=[2], 20.00th=[2],
| 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4],
| 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6],
| 99.00th=[9], 99.50th=[   10], 99.90th=[   23], 99.95th=[   28],
| 99.99th=[   62]
   bw (KB/s)  : min=35790, max=318215, per=6.31%, avg=78816.91, 
 stdev=26397.76
   lat (usec) : 2=1.33%, 4=54.43%, 10=43.54%, 20=0.56%, 50=0.11%
   lat (usec) : 100=0.03%
 cpu  : usr=0.89%, sys=12.85%, ctx=58248, majf=0, minf=9
 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0%
submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0
 
 ...what's above repeats 16 times...
 
 Run status group 0 (all jobs):
 WRITE: io=163840MB, aggrb=1219.8MB/s, minb=78060KB/s, maxb=78655KB/s, 
 mint=133312msec, maxt=134329msec
 
 +
 Two clients:
 +
 below is just 1 job info
 trivial-readwrite-gridsrv: (groupid=0, jobs=1): err= 0: pid=10605: Tue Jul 
 28 14:05:59 2015
 write: io=10240MB, bw=43154KB/s, iops=42 , runt=242984msec
   slat (usec): min=991 , max=285653 , avg=23716.12, stdev=23960.60
   clat (usec): min=1 , max=65 , avg= 3.67, stdev= 2.02
lat (usec): min=994 , max=285664 , avg=23723.39, stdev=23962.22
   clat percentiles (usec):
|  1.00th=[2],  5.00th=[2], 10.00th=[2], 20.00th=[2],
| 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4],
 

Re: [ceph-users] OSD RAM usage values

2015-07-28 Thread Mark Nelson



On 07/17/2015 07:50 AM, Gregory Farnum wrote:

On Fri, Jul 17, 2015 at 1:13 PM, Kenneth Waegeman
kenneth.waege...@ugent.be wrote:

Hi all,

I've read in the documentation that OSDs use around 512MB on a healthy
cluster.(http://ceph.com/docs/master/start/hardware-recommendations/#ram)
Now, our OSD's are all using around 2GB of RAM memory while the cluster is
healthy.


   PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
29784 root  20   0 6081276 2.535g   4740 S   0.7  8.1   1346:55 ceph-osd
32818 root  20   0 5417212 2.164g  24780 S  16.2  6.9   1238:55 ceph-osd
25053 root  20   0 5386604 2.159g  27864 S   0.7  6.9   1192:08 ceph-osd
33875 root  20   0 5345288 2.092g   3544 S   0.7  6.7   1188:53 ceph-osd
30779 root  20   0 5474832 2.090g  28892 S   1.0  6.7   1142:29 ceph-osd
22068 root  20   0 5191516 2.000g  28664 S   0.7  6.4  31:56.72 ceph-osd
34932 root  20   0 5242656 1.994g   4536 S   0.3  6.4   1144:48 ceph-osd
26883 root  20   0 5178164 1.938g   6164 S   0.3  6.2   1173:01 ceph-osd
31796 root  20   0 5193308 1.916g  27000 S  16.2  6.1 923:14.87 ceph-osd
25958 root  20   0 5193436 1.901g   2900 S   0.7  6.1   1039:53 ceph-osd
27826 root  20   0 5225764 1.845g   5576 S   1.0  5.9   1031:15 ceph-osd
36011 root  20   0 5111660 1.823g  20512 S  15.9  5.8   1093:01 ceph-osd
19736 root  20   0 2134680 0.994g  0 S   0.3  3.2  46:13.47 ceph-osd



[root@osd003 ~]# ceph status
2015-07-17 14:03:13.865063 7f1fde5f0700 -1 WARNING: the following dangerous
and experimental features are enabled: keyvaluestore
2015-07-17 14:03:13.887087 7f1fde5f0700 -1 WARNING: the following dangerous
and experimental features are enabled: keyvaluestore
 cluster 92bfcf0a-1d39-43b3-b60f-44f01b630e47
  health HEALTH_OK
  monmap e1: 3 mons at
{mds01=10.141.16.1:6789/0,mds02=10.141.16.2:6789/0,mds03=10.141.16.3:6789/0}
 election epoch 58, quorum 0,1,2 mds01,mds02,mds03
  mdsmap e17218: 1/1/1 up {0=mds03=up:active}, 1 up:standby
  osdmap e25542: 258 osds: 258 up, 258 in
   pgmap v2460163: 4160 pgs, 4 pools, 228 TB data, 154 Mobjects
 270 TB used, 549 TB / 819 TB avail
 4152 active+clean
8 active+clean+scrubbing+deep


We are using erasure code on most of our OSDs, so maybe that is a reason.
But also the cache-pool filestore OSDS on 200GB SSDs are using 2GB of RAM.
Our erasure code pool (16*14 osds) have a pg_num of 2048; our cache pool
(2*14 OSDS) has a pg_num of 1024.

Are these normal values for this configuration, and is the documentation a
bit outdated, or should we look into something else?


2GB of RSS is larger than I would have expected, but not unreasonable.
In particular I don't think we've gathered numbers on either EC pools
or on the effects of the caching processes.


FWIW, here's statistics for ~36 ceph-osds on the wip-promote-prob branch 
after several hours of cache tiering tests (30 OSD base, 6 OS cache 
tier) using an EC6+2 pool.  At the time of this test, 4K random 
read/writes were being performed.  The cache tier OSDs specifically use 
quite a bit more memory than the base tier.  Interestingly in this test 
major pagefaults are showing up for the cache tier OSDs which is 
annoying. I may need to tweak kernel VM settings on this box.



# PROCESS SUMMARY (counters are /sec)
#Time  PID  User PR  PPID THRD S   VSZ   RSS CP  SysT  UsrT Pct  
AccuTime  RKB  WKB MajF MinF Command
09:58:48   715  root 20 1  424 S1G  271M  8  0.19  0.43   6  
30:12.64000 2502 /usr/local/bin/ceph-osd
09:58:48  1363  root 20 1  424 S1G  325M  8  0.14  0.33   4  
26:50.54000   68 /usr/local/bin/ceph-osd
09:58:48  2080  root 20 1  420 S1G  276M  1  0.21  0.49   7  
23:49.36000 2848 /usr/local/bin/ceph-osd
09:58:48  2747  root 20 1  424 S1G  283M  8  0.25  0.68   9  
25:16.63000 1391 /usr/local/bin/ceph-osd
09:58:48  3451  root 20 1  424 S1G  331M  6  0.13  0.14   2  
27:36.71000  148 /usr/local/bin/ceph-osd
09:58:48  4172  root 20 1  424 S1G  301M  6  0.19  0.43   6  
29:44.56000 2165 /usr/local/bin/ceph-osd
09:58:48  4935  root 20 1  420 S1G  310M  9  0.18  0.28   4  
29:09.78000 2042 /usr/local/bin/ceph-osd
09:58:48  5750  root 20 1  420 S1G  267M  2  0.11  0.14   2  
26:55.31000  866 /usr/local/bin/ceph-osd
09:58:48  6544  root 20 1  424 S1G  299M  7  0.22  0.62   8  
26:46.35000 3468 /usr/local/bin/ceph-osd
09:58:48  7379  root 20 1  424 S1G  283M  8  0.16  0.47   6  
25:47.86000  538 /usr/local/bin/ceph-osd
09:58:48  8183  root 20 1  424 S1G  269M  4  0.25  0.67   9  
35:09.85000 2968 /usr/local/bin/ceph-osd
09:58:48  9026  root 20 1  424 S1G  261M  1  0.19  0.46   6  
26:27.36000  

Re: [ceph-users] Ceph 0.94 (and lower) performance on 1 hosts ??

2015-07-28 Thread SCHAER Frederic
Hi again,

So I have tried 
- changing the cpus frequency : either 1.6GHZ, or 2.4GHZ on all cores
- changing the memory configuration, from advanced ecc mode to performance 
mode, boosting the memory bandwidth from 35GB/s to 40GB/s
- plugged a second 10GB/s link and setup a ceph internal network
- tried various tuned-adm profile such as throughput-performance

This changed about nothing.

If 
- the CPUs are not maxed out, and lowering the frequency doesn't change a thing
- the network is not maxed out
- the memory doesn't seem to have an impact
- network interrupts are spread across all 8 cpu cores and receive queues are OK
- disks are not used at their maximum potential (iostat shows my dd commands 
produce much more tps than the 4MB ceph transfers...)

Where can I possibly find a bottleneck ?

I'm /(almost) out of ideas/ ... :'(

Regards

-Message d'origine-
De : ceph-users [mailto:ceph-users-boun...@lists.ceph.com] De la part de SCHAER 
Frederic
Envoyé : vendredi 24 juillet 2015 16:04
À : Christian Balzer; ceph-users@lists.ceph.com
Objet : [PROVENANCE INTERNET] Re: [ceph-users] Ceph 0.94 (and lower) 
performance on 1 hosts ??

Hi,

Thanks.
I did not know about atop, nice tool... and I don't seem to be IRQ overloaded - 
I can reach 100% cpu % for IRQs, but that's shared across all 8 physical cores.
I also discovered turbostat which showed me the R510s were not configured for 
performance in the bios (but dbpm - demand based power management), and were 
not bumping the CPUs frequency to 2.4GHz as they should... only apparently 
remaining at 1.6Ghz...

But changing that did not improve things unfortunately. I know have CPUs  using 
their xeon turbo frequency, but no throughput improvement.

Looking at RPS/ RSS, it looks like our Broadcom cards are configured correctly 
according to redhat, i.e : one receive queue per physical core, spreading the 
IRQ load everywhere.
One thing I noticed though is that the dell BIOS allows to change IRQs... but 
once you change the network card IRQ, it also changes the RAID card IRQ as well 
as many others, all sharing the same bios IRQ (that's therefore apparently a 
useless option). Weird.

Still attempting to determine the bottleneck ;)

Regards
Frederic

-Message d'origine-
De : Christian Balzer [mailto:ch...@gol.com] 
Envoyé : jeudi 23 juillet 2015 14:18
À : ceph-users@lists.ceph.com
Cc : Gregory Farnum; SCHAER Frederic
Objet : Re: [ceph-users] Ceph 0.94 (and lower) performance on 1 hosts ??

On Thu, 23 Jul 2015 11:14:22 +0100 Gregory Farnum wrote:

 Your note that dd can do 2GB/s without networking makes me think that
 you should explore that. As you say, network interrupts can be
 problematic in some systems. The only thing I can think of that's been
 really bad in the past is that some systems process all network
 interrupts on cpu 0, and you probably want to make sure that it's
 splitting them across CPUs.


An IRQ overload would be very visible with atop.

Splitting the IRQs will help, but it is likely to need some smarts.

As in, irqbalance may spread things across NUMA nodes.

A card with just one IRQ line will need RPS (Receive Packet Steering),
irqbalance can't help it.

For example, I have a compute node with such a single line card and Quad
Opterons (64 cores, 8 NUMA nodes).

The default is all interrupt handling on CPU0 and that is very little,
except for eth2. So this gets a special treatment:
---
echo 4 /proc/irq/106/smp_affinity_list
---
Pinning the IRQ for eth2 to CPU 4 by default

---
echo f0  /sys/class/net/eth2/queues/rx-0/rps_cpus
---
giving RPS CPUs 4-7 to work with. At peak times it needs more than 2
cores, otherwise with this architecture just using 4 and 5 (same L2 cache)
would be better.

Regards,

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Updating OSD Parameters

2015-07-28 Thread Wido den Hollander


On 28-07-15 16:53, Noah Mehl wrote:
 When we update the following in ceph.conf:
 
 [osd]
   osd_recovery_max_active = 1
   osd_max_backfills = 1
 
 How do we make sure it takes affect?  Do we have to restart all of the
 ceph osd’s and mon’s?

On a client with client.admin keyring you execute:

ceph tell osd.* injectargs '--osd_recovery_max_active=1'

It will take effect immediately. Keep in mind though that PGs which are
currently recovering are not affected.

So if a OSD is currently doing 10 backfills, it will keep doing that. It
however won't accept any new backfills. So it slowly goes down to 9, 8,
7, etc, until you see only 1 backfill active.

Same goes for recovery.

Wido

 
 Thanks!
 
 ~Noah
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Updating OSD Parameters

2015-07-28 Thread Noah Mehl
When we update the following in ceph.conf:

[osd]
  osd_recovery_max_active = 1
  osd_max_backfills = 1

How do we make sure it takes affect?  Do we have to restart all of the ceph 
osd’s and mon’s?

Thanks!

~Noah

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Updating OSD Parameters

2015-07-28 Thread Noah Mehl
Wido,

That’s awesome, I will look at this right now.

Thanks!

~Noah

 On Jul 28, 2015, at 11:02 AM, Wido den Hollander w...@42on.com wrote:
 
 
 
 On 28-07-15 16:53, Noah Mehl wrote:
 When we update the following in ceph.conf:
 
 [osd]
  osd_recovery_max_active = 1
  osd_max_backfills = 1
 
 How do we make sure it takes affect?  Do we have to restart all of the
 ceph osd’s and mon’s?
 
 On a client with client.admin keyring you execute:
 
 ceph tell osd.* injectargs '--osd_recovery_max_active=1'
 
 It will take effect immediately. Keep in mind though that PGs which are
 currently recovering are not affected.
 
 So if a OSD is currently doing 10 backfills, it will keep doing that. It
 however won't accept any new backfills. So it slowly goes down to 9, 8,
 7, etc, until you see only 1 backfill active.
 
 Same goes for recovery.
 
 Wido
 
 
 Thanks!
 
 ~Noah
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] why are there degraded PGs when adding OSDs?

2015-07-28 Thread Samuel Just
If it wouldn't be too much trouble, I'd actually like the binary osdmap as well 
(it contains the crushmap, but also a bunch of other stuff).  There is a 
command that lets you get old osdmaps from the mon by epoch as long as they 
haven't been trimmed.
-Sam

- Original Message -
From: Chad William Seys cws...@physics.wisc.edu
To: Samuel Just sj...@redhat.com
Cc: ceph-users ceph-us...@ceph.com
Sent: Tuesday, July 28, 2015 7:40:31 AM
Subject: Re: [ceph-users] why are there degraded PGs when adding OSDs?

Hi Sam,

Trying again today with crush tunables set to firefly.  Degraded peaked around 
46.8%.

I've attached the ceph pg dump and the crushmap (same as osdmap) from before 
and after the OSD additions. 3 osds were added on host osd03.  This added 5TB 
to about 17TB for a total of around 22TB.  5TB/22TB = 22.7%  Is it expected 
for 46.8% of PGs to be degraded after adding 22% of the storage?

Another weird thing is that the kernel RBD clients froze up after the OSDs 
were added, but worked fine after reboot.  (Debian kernel 3.16.7)

Thanks for checking!
C.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] which kernel version can help avoid kernel client deadlock

2015-07-28 Thread van

 On Jul 28, 2015, at 7:57 PM, Ilya Dryomov idryo...@gmail.com wrote:
 
 On Tue, Jul 28, 2015 at 2:46 PM, van chaofa...@owtware.com wrote:
 Hi, Ilya,
 
  In the dmesg, there is also a lot of libceph socket error, which I think
 may be caused by my stopping ceph service without unmap rbd.
 
 Well, sure enough, if you kill all OSDs, the filesystem mounted on top
 of rbd device will get stuck.

Sure it will get stuck if osds are stopped. And since rados requests have retry 
policy, the stucked requests will recover after I start the daemon again.

But in my case, the osds are running in normal state and librbd API can 
read/write normally.
Meanwhile, heavy fio test for the filesystem mounted on top of rbd device will 
get stuck.

I wonder if this phenomenon is triggered by running rbd kernel client on 
machines have ceph daemons, i.e. the annoying loopback mount deadlock issue.

In my opinion, if it’s due to the loopback mount deadlock, the OSDs will become 
unresponsive.
No matter the requests are from user space requests (like API) or from kernel 
client.
Am I right?

If so, my case seems to be triggered by another bug.

Anyway, it seems that I should separate client and daemons at least.

Thanks.

 
 Thanks,
 
Ilya

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Did maximum performance reached?

2015-07-28 Thread John Spray



On 28/07/15 11:17, Shneur Zalman Mattern wrote:

Oh, now I've to cry :-)
not because it's not SSDs... it's SAS2 HDDs

Because, I need to build something for 140 clients... 4200 OSDs

:-(

Looks like, I can pickup my performance by SSDs, but I need a huge capacity ~ 
2PB
Perhaps, tiering cache pool can save my money, but I've read here - that it's 
slower than all people think...

:-(

Why Lustre is more performable? There're same HDDs?


Lustre isn't (A) creating two copies of your data, and it's (B) not 
executing disk writes as atomic transactions (i.e. no data writeahead log).


The A tradeoff is that while a Lustre system typically requires an 
expensive dual ported RAID controller, Ceph doesn't.  You take the money 
you saved on RAID controllers have spend it on having a larger number of 
cheaper hosts and drives.  If you've already bought the Lustre-oriented 
hardware then my advice would be to run Lustre on it :-)


The efficient way of handling B is to use SSD journals for your OSDs.  
Typical Ceph servers have one SSD per approx 4 OSDs.


John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Did maximum performance reached?

2015-07-28 Thread John Spray



On 28/07/15 11:53, John Spray wrote:



On 28/07/15 11:17, Shneur Zalman Mattern wrote:

Oh, now I've to cry :-)
not because it's not SSDs... it's SAS2 HDDs

Because, I need to build something for 140 clients... 4200 OSDs

:-(

Looks like, I can pickup my performance by SSDs, but I need a huge 
capacity ~ 2PB
Perhaps, tiering cache pool can save my money, but I've read here - 
that it's slower than all people think...


:-(

Why Lustre is more performable? There're same HDDs?


Lustre isn't (A) creating two copies of your data, and it's (B) not 
executing disk writes as atomic transactions (i.e. no data writeahead 
log).


The A tradeoff is that while a Lustre system typically requires an 
expensive dual ported RAID controller, Ceph doesn't.  You take the 
money you saved on RAID controllers have spend it on having a larger 
number of cheaper hosts and drives.  If you've already bought the 
Lustre-oriented hardware then my advice would be to run Lustre on it :-)


The efficient way of handling B is to use SSD journals for your OSDs.  
Typical Ceph servers have one SSD per approx 4 OSDs.


Oh, I've just re-read the original message in this thread, and you're 
already using SSD journals.


So I think the only point of confusion was that you weren't dividing 
your expected bandwidth number by the number of replicas, right?


 Each spindel-disk can write ~ 100MB/s , and we have 10 SAS disks on 
each node = aggregated write speed is ~ 900MB/s (because of striping etc.)
And we have 3 OSD nodes, and objects are striped also on 30 osds - I 
thought it's also aggregateble and we'll get something around 2.5 GB/s, 
but not...


Your expected bandwidth (with size=2 replicas) will be (900MB/s * 3)/2 = 
1300MB/s -- so I think you're actually doing pretty well with your 
1367MB/s number.


John





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Weird behaviour of cephfs with samba

2015-07-28 Thread Dzianis Kahanovich
I use cephfs over samba vfs and have some issues.

1) If I use 1 stacked vfs (ceph  scannedonly) - I have problems with file
order, but solved by dirsort vfs (vfs objects = scannedonly dirsort ceph).
Single ceph vfs looks good too (and I use it single for fast internal shares),
but you can try add dirsort (vfs objects = dirsort ceph).

2) I use 2 my patches:
https://github.com/mahatma-kaganovich/raw/tree/master/app-portage/ppatch/files/extensions/net-fs/samba/compile
- to support max disk size and to secure chown. About chown: I unsure about
strict follow standard system behaviour, but it works for me, without - user can
chown() even to root. I put first (max disk size) patch into samba bugzilla
times ago, second patch - no, I unsure about it correctness, but sure about
security hole.

Jörg Henne пишет:
 Hi all,
 
 the faq at http://ceph.com/docs/cuttlefish/faq/ mentions the possibility to
 run export a mounted cephfs via samba. This combination exhibits a very
 weird behaviour, though.
 
 We have a directory on cephfs with many small xml snippets. If I repeadtedly
 ls the directory on Unix, I get the same answer each and every time:
 
 root@gru:/mnt/regtest/regressiontestdata2/assets# while true; do ls|wc -l;
 sleep 1; done
 851
 851
 851
 ... and so on
 
 If I do the same on the directory exported and mounted via SMB under Windows
 the result looks like that (output generated unter cygwin, but effect is
 present with Windows Explorer as well):
 
 $ while true; do ls|wc -l; sleep 1; done
 380
 380
 380
 380
 380
 1451
 362
 851
 851
 851
 851
 851
 851
 851
 851
 1451
 362
 851
 851
 851
 ...
 
 The problem does not seem to be related to Samba. If I copy the files to an
 XFS volume and export that, things look fine.
 
 Thanks
 Joerg Henne
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 


-- 
WBR, Dzianis Kahanovich AKA Denis Kaganovich, http://mahatma.bspu.unibel.by/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] which kernel version can help avoid kernel client deadlock

2015-07-28 Thread Ilya Dryomov
On Tue, Jul 28, 2015 at 11:19 AM, van chaofa...@owtware.com wrote:
 Hi, Ilya,

   Thanks for your quick reply.

   Here is the link http://ceph.com/docs/cuttlefish/faq/  , under the HOW
 CAN I GIVE CEPH A TRY?” section which talk about the old kernel stuff.

   By the way, what’s the main reason of using kernel 4.1, is there a lot of
 critical bugs fixed in that version despite perf improvements?
   I am worrying kernel 4.1 is too new that may introduce other problems.

Well, I'm not sure what exactly is in 3.10.0.229, so I can't tell you
off hand.  I can think of one important memory pressure related fix
that's probably not in there.

I'm suggesting the latest stable version of 4.1 (currently 4.1.3),
because if you hit a deadlock (remember, this is a configuration that
is neither recommended nor guaranteed to work), it'll be easier to
debug and fix if the fix turns out to be worth it.

If 4.1 is not acceptable for you, try the latest stable version of 3.18
(that is 3.18.19).  It's an LTS kernel, so that should mitigate some of
your concerns.

   And if I’m using the librdb API, is the kernel version matters?

No, not so much.


   In my tests, I built a 2-nodes cluster, each with only one OSD with os
 centos 7.1, kernel version 3.10.0.229 and ceph v0.94.2.
   I created several rbds and mkfs.xfs on those rbds to create filesystems.
 (kernel client were running on the ceph cluster)
   I performed heavy IO tests on those filesystems and found some fio got
 hung and turned into D state forever (uninterruptible sleep).
   I suspect it’s the deadlock that make the fio process hung.
   However the ceph-osd are stil responsive, and I can operate rbd via librbd
 API.
   Does this mean it’s not the loopback mount deadlock that cause the fio
 process hung?
   Or it is also a deadlock phnonmenon, only one thread is blocked in memory
 allocation and other threads are still possible to receive API requests, so
 the ceph-osd are still responsive?

   What worth mentioning is that after I restart the ceph-osd daemon, all
 processes in D state come back into normal state.

   Below is related log in kernel:

 Jul  7 02:25:39 node0 kernel: INFO: task xfsaild/rbd1:24795 blocked for more
 than 120 seconds.
 Jul  7 02:25:39 node0 kernel: echo 0 
 /proc/sys/kernel/hung_task_timeout_secs disables this message.
 Jul  7 02:25:39 node0 kernel: xfsaild/rbd1D 880c2fc13680 0 24795
 2 0x0080
 Jul  7 02:25:39 node0 kernel: 8801d6343d40 0046
 8801d6343fd8 00013680
 Jul  7 02:25:39 node0 kernel: 8801d6343fd8 00013680
 880c0c0b 880c0c0b
 Jul  7 02:25:39 node0 kernel: 880c2fc14340 0001
  8805bace2528
 Jul  7 02:25:39 node0 kernel: Call Trace:
 Jul  7 02:25:39 node0 kernel: [81609e39] schedule+0x29/0x70
 Jul  7 02:25:39 node0 kernel: [a03a1890]
 _xfs_log_force+0x230/0x290 [xfs]
 Jul  7 02:25:39 node0 kernel: [810a9620] ? wake_up_state+0x20/0x20
 Jul  7 02:25:39 node0 kernel: [a03a1916] xfs_log_force+0x26/0x80
 [xfs]
 Jul  7 02:25:39 node0 kernel: [a03a6390] ?
 xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
 Jul  7 02:25:39 node0 kernel: [a03a64e1] xfsaild+0x151/0x5e0 [xfs]
 Jul  7 02:25:39 node0 kernel: [a03a6390] ?
 xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
 Jul  7 02:25:39 node0 kernel: [8109739f] kthread+0xcf/0xe0
 Jul  7 02:25:39 node0 kernel: [810972d0] ?
 kthread_create_on_node+0x140/0x140
 Jul  7 02:25:39 node0 kernel: [8161497c] ret_from_fork+0x7c/0xb0
 Jul  7 02:25:39 node0 kernel: [810972d0] ?
 kthread_create_on_node+0x140/0x140
 Jul  7 02:25:39 node0 kernel: INFO: task xfsaild/rbd5:2914 blocked for more
 than 120 seconds.

Is that all there is in dmesg?  Can you paste the entire dmesg?

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Did maximum performance reached?

2015-07-28 Thread Shneur Zalman Mattern
As I'm understanding now that's in this case (30 disks) 10Gbit Network is not a 
bottleneck!

With other HW config ( + 5 OSD nodes = + 50 disks ) I'd get 3400 MB/s,
and 3 clients can work on full bandwidth, yes?

OK, let's try ! ! ! ! ! ! !

Perhaps, somebody has more suggestions for increasing performance:
1. NVMe journals, 
2. btrfs over osd
3. ssd-based osds,
4. 15K hdds 
5. RAID 10 on each OSD node
.
everybody - brainstorm!!!

John:
Your expected bandwidth (with size=2 replicas) will be (900MB/s * 3)/2 =
1300MB/s -- so I think you're actually doing pretty well with your
1367MB/s number.











This footnote confirms that this email message has been scanned by
PineApp Mail-SeCure for the presence of malicious code, vandals  computer 
viruses.





 
 

This footnote confirms that this email message has been scanned by
PineApp Mail-SeCure for the presence of malicious code, vandals  computer 
viruses.




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Did maximum performance reached?

2015-07-28 Thread Udo Lembke
Hi,

On 28.07.2015 12:02, Shneur Zalman Mattern wrote:
 Hi!

 And so, in your math
 I need to build size = osd, 30 replicas for my cluster of 120TB - to get my 
 demans 
30 replicas is the wrong math! Less replicas = more speed (because of
less writing).
More replicas less speed.
Fore data safety an replica of 3 is recommended.


Udo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Weird behaviour of cephfs with samba

2015-07-28 Thread Dzianis Kahanovich
PS I start to use this patches with samba 4.1. IMHO some of problems may (or
must) be solved not inside vfs code, but outside - in samba kernel, but I still
use both in samba 4.2.3 without verification.

Dzianis Kahanovich пишет:
 I use cephfs over samba vfs and have some issues.
 
 1) If I use 1 stacked vfs (ceph  scannedonly) - I have problems with file
 order, but solved by dirsort vfs (vfs objects = scannedonly dirsort ceph).
 Single ceph vfs looks good too (and I use it single for fast internal 
 shares),
 but you can try add dirsort (vfs objects = dirsort ceph).
 
 2) I use 2 my patches:
 https://github.com/mahatma-kaganovich/raw/tree/master/app-portage/ppatch/files/extensions/net-fs/samba/compile
 - to support max disk size and to secure chown. About chown: I unsure about
 strict follow standard system behaviour, but it works for me, without - user 
 can
 chown() even to root. I put first (max disk size) patch into samba bugzilla
 times ago, second patch - no, I unsure about it correctness, but sure about
 security hole.
 
 Jörg Henne пишет:
 Hi all,

 the faq at http://ceph.com/docs/cuttlefish/faq/ mentions the possibility to
 run export a mounted cephfs via samba. This combination exhibits a very
 weird behaviour, though.

 We have a directory on cephfs with many small xml snippets. If I repeadtedly
 ls the directory on Unix, I get the same answer each and every time:

 root@gru:/mnt/regtest/regressiontestdata2/assets# while true; do ls|wc -l;
 sleep 1; done
 851
 851
 851
 ... and so on

 If I do the same on the directory exported and mounted via SMB under Windows
 the result looks like that (output generated unter cygwin, but effect is
 present with Windows Explorer as well):

 $ while true; do ls|wc -l; sleep 1; done
 380
 380
 380
 380
 380
 1451
 362
 851
 851
 851
 851
 851
 851
 851
 851
 1451
 362
 851
 851
 851
 ...

 The problem does not seem to be related to Samba. If I copy the files to an
 XFS volume and export that, things look fine.

 Thanks
 Joerg Henne

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 
 


-- 
WBR, Dzianis Kahanovich AKA Denis Kaganovich, http://mahatma.bspu.unibel.by/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unable to create new pool in cluster

2015-07-28 Thread Daleep Bais
Dear Kefu,

Thanks..
It worked..

Appreciate your help..

TC

On Sun, Jul 26, 2015 at 8:06 AM, kefu chai tchai...@gmail.com wrote:

 On Sat, Jul 25, 2015 at 9:43 PM, Daleep Bais daleepb...@gmail.com wrote:
  Hi All,
 
  I am unable to create new pool in my cluster. I have some existing pools.
 
  I get error :
 
  ceph osd pool create fullpool 128 128
  Error EINVAL: crushtool: exec failed: (2) No such file or directory
 
 
  existing pools are :
 
  cluster# ceph osd lspools
  0 rbd,1 data,3 pspl,
 
  Please suggest..


 Daleep, seems your crushtool is not in $PATH when the monitor started.
 you might want to make sure you have crushtool installed somewhere,
 and:

 $ ceph --admin-daemon path-to-your-admin-socket config show | grep
 crushtool ## check the patch to crushtool
 $ ceph tell mon.* injectargs --crushtool path-to-your-crushtool ##
 point it to your crushtool


 HTH.

 --
 Regards
 Kefu Chai

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] which kernel version can help avoid kernel client deadlock

2015-07-28 Thread Ilya Dryomov
On Tue, Jul 28, 2015 at 2:46 PM, van chaofa...@owtware.com wrote:
 Hi, Ilya,

   In the dmesg, there is also a lot of libceph socket error, which I think
 may be caused by my stopping ceph service without unmap rbd.

Well, sure enough, if you kill all OSDs, the filesystem mounted on top
of rbd device will get stuck.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] which kernel version can help avoid kernel client deadlock

2015-07-28 Thread Ilya Dryomov
On Tue, Jul 28, 2015 at 7:20 PM, van chaofa...@owtware.com wrote:

 On Jul 28, 2015, at 7:57 PM, Ilya Dryomov idryo...@gmail.com wrote:

 On Tue, Jul 28, 2015 at 2:46 PM, van chaofa...@owtware.com wrote:
 Hi, Ilya,

  In the dmesg, there is also a lot of libceph socket error, which I think
 may be caused by my stopping ceph service without unmap rbd.

 Well, sure enough, if you kill all OSDs, the filesystem mounted on top
 of rbd device will get stuck.

 Sure it will get stuck if osds are stopped. And since rados requests have 
 retry policy, the stucked requests will recover after I start the daemon 
 again.

 But in my case, the osds are running in normal state and librbd API can 
 read/write normally.
 Meanwhile, heavy fio test for the filesystem mounted on top of rbd device 
 will get stuck.

 I wonder if this phenomenon is triggered by running rbd kernel client on 
 machines have ceph daemons, i.e. the annoying loopback mount deadlock issue.

 In my opinion, if it’s due to the loopback mount deadlock, the OSDs will 
 become unresponsive.
 No matter the requests are from user space requests (like API) or from kernel 
 client.
 Am I right?

Not necessarily.


 If so, my case seems to be triggered by another bug.

 Anyway, it seems that I should separate client and daemons at least.

Try 3.18.19 if you can.  I'd be interested in your results.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] which kernel version can help avoid kernel client deadlock

2015-07-28 Thread van
Hi, list,

  I found on the ceph FAQ that, ceph kernel client should not run on machines 
belong to ceph cluster.
  As ceph FAQ metioned, “In older kernels, Ceph can deadlock if you try to 
mount CephFS or RBD client services on the same host that runs your test Ceph 
cluster. This is not a Ceph-related issue.”
  Here it says that there will be deadlock if using old kernel version.
  I wonder if anyone knows which new kernel version solve this loopback mount 
deadlock.
  It will be a great help since I do need to use rbd kernel client on the ceph 
cluster.
  
  As I search more informations, I found two articals 
https://lwn.net/Articles/595652/ https://lwn.net/Articles/595652/ and 
https://lwn.net/Articles/596618/ https://lwn.net/Articles/596618/  talk about 
supporting nfs loopback mount,it seems they do effort not on memory management 
only, but also on nfs related codes, I wonder if ceph has also so some effort 
on kernel client to solve this problem. If ceph did, could anyone help provide 
the kernel version with the patch?
  
Thanks.


van
chaofa...@owtware.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RadosGW - radosgw-agent start error

2015-07-28 Thread Italo Santos
Hello everyone,  

I’m setting up a federated configuration of radosgw but when I start a 
radosgw-agent I face with the error bellow and I’d like to know if I’m doing 
something wrong…?

See the error:

root@cephgw0001:~# radosgw-agent -v -c /etc/ceph/radosgw-agent/default.conf
2015-07-28 17:02:03,103 3600 [radosgw_agent][INFO  ]  ____  
 __   ___  ___
2015-07-28 17:02:03,103 3600 [radosgw_agent][INFO  ] /__` \ / |\ | /  ` /\  
/ _` |__  |\ |  |
2015-07-28 17:02:03,104 3600 [radosgw_agent][INFO  ] .__/  |  | \| \__,/~~\ 
\__ |___ | \|  |
2015-07-28 17:02:03,104 3600 [radosgw_agent][INFO  ]
  v1.2.3
2015-07-28 17:02:03,105 3600 [radosgw_agent][INFO  ] agent options:
2015-07-28 17:02:03,105 3600 [radosgw_agent][INFO  ]  args:
2015-07-28 17:02:03,106 3600 [radosgw_agent][INFO  ]conf
  : None
2015-07-28 17:02:03,106 3600 [radosgw_agent][INFO  ]dest_access_key 
  : 
2015-07-28 17:02:03,107 3600 [radosgw_agent][INFO  ]dest_secret_key 
  : 
2015-07-28 17:02:03,108 3600 [radosgw_agent][INFO  ]destination 
  : http://tmk.object-storage.local:80
2015-07-28 17:02:03,108 3600 [radosgw_agent][INFO  ]incremental_sync_delay  
  : 30
2015-07-28 17:02:03,109 3600 [radosgw_agent][INFO  ]lock_timeout
  : 60
2015-07-28 17:02:03,109 3600 [radosgw_agent][INFO  ]log_file
  : /var/log/radosgw/radosgw-sync.log
2015-07-28 17:02:03,110 3600 [radosgw_agent][INFO  ]log_lock_time   
  : 20
2015-07-28 17:02:03,110 3600 [radosgw_agent][INFO  ]max_entries 
  : 1000
2015-07-28 17:02:03,111 3600 [radosgw_agent][INFO  ]metadata_only   
  : False
2015-07-28 17:02:03,111 3600 [radosgw_agent][INFO  ]num_workers 
  : 1
2015-07-28 17:02:03,112 3600 [radosgw_agent][INFO  ]object_sync_timeout 
  : 216000
2015-07-28 17:02:03,112 3600 [radosgw_agent][INFO  ]prepare_error_delay 
  : 10
2015-07-28 17:02:03,113 3600 [radosgw_agent][INFO  ]quiet   
  : False
2015-07-28 17:02:03,113 3600 [radosgw_agent][INFO  ]rgw_data_log_window 
  : 30
2015-07-28 17:02:03,114 3600 [radosgw_agent][INFO  ]source  
  : None
2015-07-28 17:02:03,114 3600 [radosgw_agent][INFO  ]src_access_key  
  : 
2015-07-28 17:02:03,115 3600 [radosgw_agent][INFO  ]src_secret_key  
  : 
2015-07-28 17:02:03,115 3600 [radosgw_agent][INFO  ]src_zone
  : None
2015-07-28 17:02:03,116 3600 [radosgw_agent][INFO  ]sync_scope  
  : incremental
2015-07-28 17:02:03,116 3600 [radosgw_agent][INFO  ]test_server_host
  : None
2015-07-28 17:02:03,117 3600 [radosgw_agent][INFO  ]test_server_port
  : 8080
2015-07-28 17:02:03,118 3600 [radosgw_agent][INFO  ]verbose 
  : True
2015-07-28 17:02:03,118 3600 [radosgw_agent][INFO  ]versioned   
  : False
2015-07-28 17:02:03,118 3600 [radosgw_agent.client][INFO  ] creating connection 
to endpoint: http://tmk.object-storage.local:80
2015-07-28 17:02:03,120 3600 [radosgw_agent][ERROR ] RegionMapError: Could not 
retrieve region map from destination: make_request() got an unexpected keyword 
argument 'params'


Regards.

Italo Santos
http://italosantos.com.br/

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Configuring MemStore in Ceph

2015-07-28 Thread Aakanksha Pudipeddi-SSI
Hello,

I am trying to setup a ceph cluster with a memstore backend. The problem is, it 
is always created with a fixed size (1GB). I made changes to the ceph.conf file 
as follows:

osd_objectstore = memstore
memstore_device_bytes = 5*1024*1024*1024

The resultant cluster still has 1GB allocated to it. Could anybody point out 
what I am doing wrong here?

Thanks,
Aakanksha
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com