date:20150701

[ceph-users] Rados gateway / RBD access restrictions

2015-07-01 Thread Jacek Jarosiewicz


Hi,

I've been playing around with the rados gateway and RBD and have some 
questions about user access restrictions. I'd like to be able to set up 
a cluster that would be shared among different clients without any 
conflicts...


Is there a way to limit S3/Swift clients to be able to write data only 
to one bucket? Now S3 users can create their own buckets and as many as 
they want - it would be good to have some kind of control over what user 
can and can't do. I found this thread about namespaces:


http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-August/033451.html

..but it's old and I was wondering if maybe the namespace feature is 
better documented now somewhere?


Another problem: is there a way to limit RBD clients to a certain 
bandwidth and/or iops they can use? So that one client can't disrupt 
another client's vm's for example?


Cheers,
J

--
Jacek Jarosiewicz
Administrator Systemów Informatycznych


SUPERMEDIA Sp. z o.o. z siedzibą w Warszawie
ul. Senatorska 13/15, 00-075 Warszawa
Sąd Rejonowy dla m.st.Warszawy, XII Wydział Gospodarczy Krajowego 
Rejestru Sądowego,

nr KRS 029537; kapitał zakładowy 42.756.000 zł
NIP: 957-05-49-503
Adres korespondencyjny: ul. Jubilerska 10, 04-190 Warszawa


SUPERMEDIA -   http://www.supermedia.pl
dostep do internetu - hosting - kolokacja - lacza - telefonia
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] file/directory invisible through ceph-fuse

2015-07-01 Thread flisky


On 2015年07月01日 16:11, Gregory Farnum wrote:

On Wed, Jul 1, 2015 at 9:02 AM, flisky yinjif...@lianjia.com wrote:

Hi list,

I meet a strange problem:

sometimes I cannot see the file/directory created by another ceph-fuse
client. It comes into visible after I touch/mkdir the same name.

Any thoughts?


What version are you running? We've seen a few things like this with
older releases, although usually it's in the kernel...
-Greg



ceph-fuse: 0.94.1
kernel version: 2.6.32-431.el6.x86_64
FUSE library version: 2.8.3
FUSE kernel interface version 7.12

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] bucket owner vs S3 ACL?

2015-07-01 Thread Florent MONTHEL

Hi Valery,

With the old account did you try to give FULL access to the new one user ID ?

Process should be :
From OLD account add FULL access to NEW account (S3 ACL with CloudBerry for 
example) 
With radosgw admin update link from OLD account to NEW account (link allow user 
to see bucket with bucket list command)
From NEW account remove FULL access to old account (S3 ACL with CloudBerry for 
example)

Thanks


 On Jun 29, 2015, at 11:46 AM, Valery Tschopp valery.tsch...@switch.ch wrote:
 
 Hi guys,
 
 We use the radosgw (v0.80.9) with the Openstack Keystone integration.
 
 One project have been deleted, so now I have to transfer the ownership of all 
 the buckets to another user/project.
 
 Using radosgw-admin I have changed the owner:
 
 radosgw-admin bucket link --uid NEW_USER_ID --bucket BUCKET_NAME
 
 And the owner have been update:
 
 radosgw-admin bucket stats --bucket BUCKET_NAME
 
 { bucket: BUCKET_NAME,
  pool: .rgw.buckets,
  index_pool: .rgw.buckets.index,
  id: default.4063334.17,
  marker: default.4063334.17,
  owner: NEW_USER_ID,
  ver: 66301,
  master_ver: 0,
  mtime: 1435583681,
  max_marker: ,
  usage: { rgw.main: { size_kb: 189433890,
  size_kb_actual: 189473684,
  num_objects: 19043},
  rgw.multimeta: { size_kb: 0,
  size_kb_actual: 0,
  num_objects: 0}},
  bucket_quota: { enabled: false,
  max_size_kb: -1,
  max_objects: -1}
 }
 
 But the S3 ACL of this bucket is still referencing the old user/project (from 
 radosgw.log) when I try to access it with the new owner:
 
 2015-06-29 17:08:33.236265 7f40d8a76700 15 Read 
 AccessControlPolicyAccessControlPolicy 
 xmlns=http://s3.amazonaws.com/doc/2006-03-01/;OwnerIDOLD_USER_ID/IDDisplayNameOLD_PROJECT_NAME/DisplayName/OwnerAccessControlListGrantGrantee
  xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance; 
 xsi:type=CanonicalUserIDOLD_USER_ID/IDDisplayNameOLD_PROJECT_NAME/DisplayName/GranteePermissionFULL_CONTROL/Permission/Grant/AccessControlList/AccessControlPolicy
 
 
 Therefore I get a 403, because the S3 ACL still enforce the old owner, not 
 the new one.
 
 How can I update these S3 ACL, and fully transfer the ownership to the new 
 owner/project???
 
 Cheers,
 Valery
 
 
 
 -- 
 SWITCH
 --
 Valery Tschopp, Software Engineer, Peta Solutions
 Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
 email: valery.tsch...@switch.ch phone: +41 44 268 1544
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Very low 4k randread performance ~1000iops

2015-07-01 Thread Tuomas Juntunen

Thanks Mark

Are there any plans for ZFS like L2ARC to CEPH or is the cache tiering what
should work like this in the future?

I have tested cache tier + EC pool, and that created too much load on our
servers, so it was not viable to be used.

I was also wondering if EnhanceIO would be a good solution for getting more
random iops. I've read some Sébastien's writings.

Br,
Tuomas


-Original Message-
From: Mark Nelson [mailto:mnel...@redhat.com] 
Sent: 1. heinäkuuta 2015 20:29
To: Tuomas Juntunen; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops

On 07/01/2015 12:13 PM, Tuomas Juntunen wrote:
 Hi

 Yes, the OSD's are on spinning disks and we have 18 SSD's for journal, 
 one SSD for two OSD's

 The OSD's are:
 Model Family: Seagate Barracuda 7200.14 (AF)
 Device Model: ST2000DM001-1CH164

 What I've understood the journals are not used as read cache at all, 
 just for writing. Would SSD based cache pool be viable solution here?

Ok, so that makes more sense. The performance is still lower than expected
but maybe 3-4x rather than several orders of magnitude.  My guess is that
cache tiering in it's current form probably won't help you much unless you
have a workload that fits mostly into the cache.  The promotion penalty is
really high though so we likely will have to promote much more slowly than
we currently do.

Mark


 Br, T

 -Original Message-
 From: Mark Nelson [mailto:mnel...@redhat.com]
 Sent: 1. heinäkuuta 2015 13:58
 To: Tuomas Juntunen; ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops



 On 06/30/2015 10:42 PM, Tuomas Juntunen wrote:
 Hi

 For seq reads here's the latencies:
   lat (usec) : 2=0.01%, 10=0.01%, 20=0.01%, 50=0.02%, 100=0.03%
   lat (usec) : 250=1.02%, 500=87.09%, 750=7.47%, 1000=1.50%
   lat (msec) : 2=0.76%, 4=1.72%, 10=0.19%, 20=0.19%

 Random reads:
   lat (usec) : 10=0.01%
   lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.03%, 50=0.55%
   lat (msec) : 100=99.31%, 250=0.08%

 100msecs seems a lot to me.

 It is, but what's more interesting imho is that it's so consistent.  You
 don't have some ops completing fast and other ones completing slowly
holding
 everything up.  It's like the OSDs are simply overloaded with concurrent
IOs
 and everything is waiting.  Maybe I'm confused, are your OSDs on SSDs?
Are
 there spinning disks involved?  If so, what model(s)?

 You might want to use collectl -sD -oT on one of the OSD nodes during
the
 test and see what the IO to the disk looks like during random reads and
the
 especially with the svctime for the disks is like.

 Mark


 Br,T

 -Original Message-
 From: Mark Nelson [mailto:mnel...@redhat.com]
 Sent: 30. kesäkuuta 2015 22:01
 To: Tuomas Juntunen; ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops

 Seems reasonable.  What's the latency distribution look like in your
 fio output file?  Would be useful to know if it's universally slow or
 if some ops are taking much longer to complete than others.

 Mark

 On 06/30/2015 01:27 PM, Tuomas Juntunen wrote:
 I created a file which has the following parameters


 [random-read]
 rw=randread
 size=128m
 directory=/root/asd
 ioengine=libaio
 bs=4k
 #numjobs=8
 iodepth=64


 Br,T
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
 Of Mark Nelson
 Sent: 30. kesäkuuta 2015 20:55
 To: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops

 Hi Tuomos,

 Can you paste the command you ran to do the test?

 Thanks,
 Mark

 On 06/30/2015 12:18 PM, Tuomas Juntunen wrote:
 Hi

 Its not probably hitting the disks, but that really doesnt matter.
 The point is we have very responsive VMs while writing and that is
 what the users will see.

 The iops we get with sequential read is good, but the random read is
 way too low.

 Is using SSDs as OSDs the only way to get it up? or is there some
 tunable which would enhance it? I would assume Linux caches reads in
 memory and serves them from there, but atleast now we dont see it.

 Br,

 Tuomas

 *From:*Somnath Roy [mailto:somnath@sandisk.com]
 *Sent:* 30. kesäkuuta 2015 19:24
 *To:* Tuomas Juntunen; 'ceph-users'
 *Subject:* RE: [ceph-users] Very low 4k randread performance
 ~1000iops

 Break it down, try fio-rbd to see what is the performance you getting..

 But, I am really surprised you are getting  100k iops for write,
 did you check it is hitting the disks ?

 Thanks  Regards

 Somnath

 *From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On
 Behalf Of *Tuomas Juntunen
 *Sent:* Tuesday, June 30, 2015 8:33 AM
 *To:* 'ceph-users'
 *Subject:* [ceph-users] Very low 4k randread performance ~1000iops

 Hi

 I have been trying to figure out why our 4k random reads in VMs are
 so bad. I am using fio to test this.

 Write : 170k iops

 Random write : 109k iops

 Read :

Re: [ceph-users] Very low 4k randread performance ~1000iops

2015-07-01 Thread Mark Nelson




On 07/01/2015 01:39 PM, Tuomas Juntunen wrote:

Thanks Mark

Are there any plans for ZFS like L2ARC to CEPH or is the cache tiering what
should work like this in the future?

I have tested cache tier + EC pool, and that created too much load on our
servers, so it was not viable to be used.


We are doing a lot of work in this space right now.  Hopefully we'll see 
improvements coming in the coming releases.




I was also wondering if EnhanceIO would be a good solution for getting more
random iops. I've read some Sébastien's writings.


Possibly!  Try it and let us know. ;)



Br,
Tuomas


-Original Message-
From: Mark Nelson [mailto:mnel...@redhat.com]
Sent: 1. heinäkuuta 2015 20:29
To: Tuomas Juntunen; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops

On 07/01/2015 12:13 PM, Tuomas Juntunen wrote:

Hi

Yes, the OSD's are on spinning disks and we have 18 SSD's for journal,
one SSD for two OSD's

The OSD's are:
Model Family: Seagate Barracuda 7200.14 (AF)
Device Model: ST2000DM001-1CH164

What I've understood the journals are not used as read cache at all,
just for writing. Would SSD based cache pool be viable solution here?


Ok, so that makes more sense. The performance is still lower than expected
but maybe 3-4x rather than several orders of magnitude.  My guess is that
cache tiering in it's current form probably won't help you much unless you
have a workload that fits mostly into the cache.  The promotion penalty is
really high though so we likely will have to promote much more slowly than
we currently do.

Mark



Br, T

-Original Message-
From: Mark Nelson [mailto:mnel...@redhat.com]
Sent: 1. heinäkuuta 2015 13:58
To: Tuomas Juntunen; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops



On 06/30/2015 10:42 PM, Tuomas Juntunen wrote:

Hi

For seq reads here's the latencies:
   lat (usec) : 2=0.01%, 10=0.01%, 20=0.01%, 50=0.02%, 100=0.03%
   lat (usec) : 250=1.02%, 500=87.09%, 750=7.47%, 1000=1.50%
   lat (msec) : 2=0.76%, 4=1.72%, 10=0.19%, 20=0.19%

Random reads:
   lat (usec) : 10=0.01%
   lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.03%, 50=0.55%
   lat (msec) : 100=99.31%, 250=0.08%

100msecs seems a lot to me.


It is, but what's more interesting imho is that it's so consistent.  You
don't have some ops completing fast and other ones completing slowly

holding

everything up.  It's like the OSDs are simply overloaded with concurrent

IOs

and everything is waiting.  Maybe I'm confused, are your OSDs on SSDs?

Are

there spinning disks involved?  If so, what model(s)?

You might want to use collectl -sD -oT on one of the OSD nodes during

the

test and see what the IO to the disk looks like during random reads and

the

especially with the svctime for the disks is like.

Mark



Br,T

-Original Message-
From: Mark Nelson [mailto:mnel...@redhat.com]
Sent: 30. kesäkuuta 2015 22:01
To: Tuomas Juntunen; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops

Seems reasonable.  What's the latency distribution look like in your
fio output file?  Would be useful to know if it's universally slow or
if some ops are taking much longer to complete than others.

Mark

On 06/30/2015 01:27 PM, Tuomas Juntunen wrote:

I created a file which has the following parameters


[random-read]
rw=randread
size=128m
directory=/root/asd
ioengine=libaio
bs=4k
#numjobs=8
iodepth=64


Br,T
-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
Of Mark Nelson
Sent: 30. kesäkuuta 2015 20:55
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops

Hi Tuomos,

Can you paste the command you ran to do the test?

Thanks,
Mark

On 06/30/2015 12:18 PM, Tuomas Juntunen wrote:

Hi

It’s not probably hitting the disks, but that really doesn’t matter.
The point is we have very responsive VM’s while writing and that is
what the users will see.

The iops we get with sequential read is good, but the random read is
way too low.

Is using SSD’s as OSD’s the only way to get it up? or is there some
tunable which would enhance it? I would assume Linux caches reads in
memory and serves them from there, but atleast now we don’t see it.

Br,

Tuomas

*From:*Somnath Roy [mailto:somnath@sandisk.com]
*Sent:* 30. kesäkuuta 2015 19:24
*To:* Tuomas Juntunen; 'ceph-users'
*Subject:* RE: [ceph-users] Very low 4k randread performance
~1000iops

Break it down, try fio-rbd to see what is the performance you getting..

But, I am really surprised you are getting  100k iops for write,
did you check it is hitting the disks ?

Thanks  Regards

Somnath

*From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On
Behalf Of *Tuomas Juntunen
*Sent:* Tuesday, June 30, 2015 8:33 AM
*To:* 'ceph-users'
*Subject:* [ceph-users] Very low 4k randread performance ~1000iops

Hi

I have been trying

Re: [ceph-users] Error create subuser

2015-07-01 Thread Mikaël Guichard


Hi,

I think it's because secret key for swift subuser is not generated :

radosgw-admin key create --subuser=johndoe:swift --key-type=swift --gen-secret


Mikaël



Le 01/07/2015 14:50, Jimmy Goffaux a écrit :

radosgw-agent= 1.2.1trust
Ubuntu 14.04

English version :

Hello,

According to the documentation here: 
http://ceph.com/docs/master/radosgw/admin/
I followed to the letter the documentation and the result is totally 
different:


root@ih-prd-rgw01:~# radosgw-admin user create --uid=johndoe 
--display-name=John Doe --email=m...@email.com

{ user_id: johndoe,
[]
  subusers: [],
  keys: [
{ user: johndoe,
  access_key: SO4FYX3VXA8TO9D9AAM9,
  secret_key: tnOIYOuPztmWcnYfP5fGfPAWb5+thPqqdML0+Fmc}],
  swift_keys: [],
[...]

root@ih-prd-rgw01:~# radosgw-admin subuser create --uid=johndoe 
--subuser=johndoe:swift --access=full

{ user_id: johndoe,
[...]
  subusers: [
{ id: johndoe:swift,
  permissions: full-control}],
  keys: [
{ user: johndoe:swift,
  access_key: 6ENTC5V4OD15A3UO9B11,
  secret_key: },
{ user: johndoe,
  access_key: SO4FYX3VXA8TO9D9AAM9,
  secret_key: tnOIYOuPztmWcnYfP5fGfPAWb5+thPqqdML0+Fmc}],
  swift_keys: [],
[.]

Would you have any idea why the SWIFT user is not in the tags 
swift_keys?


Thanks...

French version :

Bonjour,

En suivant la documentation ici : 
http://ceph.com/docs/master/radosgw/admin/
J'ai suivi à la lettre la documentation  et le résultat est totalement 
différents :


root@ih-prd-rgw01:~# radosgw-admin user create --uid=johndoe 
--display-name=John Doe --email=ji...@goffaux.fr

{ user_id: johndoe,
[]
  subusers: [],
  keys: [
{ user: johndoe,
  access_key: SO4FYX3VXA8TO9D9AAM9,
  secret_key: tnOIYOuPztmWcnYfP5fGfPAWb5+thPqqdML0+Fmc}],
  swift_keys: [],
[...]

root@ih-prd-rgw01:~# radosgw-admin subuser create --uid=johndoe 
--subuser=johndoe:swift --access=full

{ user_id: johndoe,
[...]
  subusers: [
{ id: johndoe:swift,
  permissions: full-control}],
  keys: [
{ user: johndoe:swift,
  access_key: 6ENTC5V4OD15A3UO9B11,
  secret_key: },
{ user: johndoe,
  access_key: SO4FYX3VXA8TO9D9AAM9,
  secret_key: tnOIYOuPztmWcnYfP5fGfPAWb5+thPqqdML0+Fmc}],
  swift_keys: [],
[.]

Auriez-vous une idée pourquoi l'utilisateur SWIFT ne se trouve pas 
dans les balises swift_keys ?


Merci


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Freezes on VM's after upgrade from Giant to Hammer, app is not responding

2015-07-01 Thread Mateusz Skała

Hi Cephers,

On Sunday evening we are upgraded Ceph form 0.87 to 0.94. After upgrade VM's
running on Proxmox, freezes for 3-4s  in 10min periods (application is not
responding on Windows). Before upgrade everything was working fine. On
/proc/diskstats at field 7 (time spent reading (ms) ) and 11 (time spent
writing (ms)) there are peaks from 90ms  to 2000ms on osd disks. 

Is there any default settings changed? Every server have one ssd for journal
and 4-6 osd. BBU are ok on controllers. Our ceph.conf below.  

 

[global]

fsid={UUID}

 

mon initial members = ceph35, ceph30, ceph20, ceph15, ceph10

mon host = 10.20.8.35, 10.20.8.30, 10.20.8.20, 10.20.8.15, 10.20.8.10

 

public network = 10.20.8.0/22

cluster network = 10.20.4.0/22

 

filestore xattr use omap = true

filestore max sync interval = 30

 

osd journal size = 10240

osd mount options xfs = rw,noatime,inode64,allocsize=4M

osd pool default size = 3

osd pool default min size = 1

osd pool default pg num = 2048

osd pool default pgp num = 2048

osd disk thread ioprio class = idle

osd disk thread ioprio priority = 7

 

osd crush chooseleaf type = 1

osd recovery max active = 1

osd recovery op priority = 1

osd max backfills = 1

 

auth cluster required = cephx

auth service required = cephx

auth client required = cephx

 

rbd default format = 2

 

##ceph35 osds

[osd.0]

cluster addr = 10.20.4.35

[osd.1]

cluster addr = 10.20.4.35

[osd.2]

cluster addr = 10.20.4.35

[osd.3]

cluster addr = 10.20.4.35

[osd.4]

cluster addr = 10.20.4.35

[osd.5]

cluster addr = 10.20.4.35

 

##ceph25 osds

[osd.6]

cluster addr = 10.20.4.25

[osd.7]

cluster addr = 10.20.4.25

[osd.8]

cluster addr = 10.20.4.25

[osd.9]

cluster addr = 10.20.4.25

[osd.10]

cluster addr = 10.20.4.25

[osd.11]

cluster addr = 10.20.4.25

 

##ceph15 osds

[osd.12]

cluster addr = 10.20.4.15

[osd.13]

cluster addr = 10.20.4.15

[osd.14]

cluster addr = 10.20.4.15

[osd.15]

cluster addr = 10.20.4.15

 

##ceph30 osds

[osd.16]

cluster addr = 10.20.4.30

[osd.17]

cluster addr = 10.20.4.30

[osd.18]

cluster addr = 10.20.4.30

[osd.19]

cluster addr = 10.20.4.30

[osd.20]

cluster addr = 10.20.4.30

[osd.21]

cluster addr = 10.20.4.30

 

##ceph20 osds

[osd.22]

cluster addr = 10.20.4.20

[osd.23]

cluster addr = 10.20.4.20

[osd.24]

cluster addr = 10.20.4.20

[osd.25]

cluster addr = 10.20.4.20

[osd.26]

cluster addr = 10.20.4.20

[osd.27]

cluster addr = 10.20.4.20

 

##ceph10 osd

[osd.28]

cluster addr = 10.20.4.10

[osd.29]

cluster addr = 10.20.4.10

[osd.30]

cluster addr = 10.20.4.10

[osd.31]

cluster addr = 10.20.4.10

 

 

#adresy monitorów

[mon.ceph35]

host = ceph35

mon addr = 10.20.8.35:6789

[mon.ceph30]

host = ceph30

mon addr = 10.20.8.30:6789

[mon.ceph20]

host = ceph20

mon addr = 10.20.8.20:6789

[mon.ceph15]

host = ceph15

mon addr = 10.20.8.15:6789

[mon.ceph10]

host = ceph10

mon addr = 10.20.8.10:6789

 

Thanks for help.

Regards 

Mateusz

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Ceph references

2015-07-01 Thread Florent MONTHEL

Hi community

Do you know if there is page with all the official Ceph cluster deployed ? With 
number of nodes, volumetry, protocol (block / file / object)
If not are you agree to create this list on Ceph site?
Thanks

Sent from my iPhone
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Error create subuser

2015-07-01 Thread Mikaël Guichard

It's not really a problem, swift johndoe user works if he has a record 
in swift_keys.


The s3 secret key of johndoe user is here :
keys: [
{ user: johndoe,
  access_key: 91KC4JI5BRO39A22JY9I,
  secret_key: Z5kLaBtg870xBhYtb4RKY82qGsbiqRpGs\/KQUXKF},

I tested swift and s3 access with the same configuration as yours and it 
works for me.


Mikaël

Le 01/07/2015 15:09, Jimmy Goffaux a écrit :

{ user: johndoe:swift,
  access_key: UFSCBO5JXROB8641XF52,
  secret_key: }],


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Error create subuser

2015-07-01 Thread Jimmy Goffaux


hi,

Thank you for your return but,

I just regenerate the user completely and I confirm that I have a 
problem :(


radosgw-admin user create --uid=johndoe --display-name=John Doe 
--email=m...@email.com


  subusers: [],
  keys: [
{ user: johndoe,
  access_key: 91KC4JI5BRO39A22JY9I,
  secret_key: Z5kLaBtg870xBhYtb4RKY82qGsbiqRpGs\/KQUXKF}],
  swift_keys: []

radosgw-admin key create --subuser=johndoe:swift --key-type=swift 
--gen-secret


subusers: [],
  keys: [
{ user: johndoe,
  access_key: 91KC4JI5BRO39A22JY9I,
  secret_key: Z5kLaBtg870xBhYtb4RKY82qGsbiqRpGs\/KQUXKF}],
  swift_keys: [
{ user: johndoe:swift,
  secret_key: 04ZuTaKP8Eq8WBW9fMZJItzkeSOpc9jJkAdSe4pO}],



radosgw-admin subuser create --uid=johndoe --subuser=johndoe:swift 
--access=full


subusers: [
{ id: johndoe:swift,
  permissions: full-control}],
  keys: [
{ user: johndoe,
  access_key: 91KC4JI5BRO39A22JY9I,
  secret_key: Z5kLaBtg870xBhYtb4RKY82qGsbiqRpGs\/KQUXKF},
{ user: johndoe:swift,
  access_key: UFSCBO5JXROB8641XF52,
  secret_key: }],
  swift_keys: [
{ user: johndoe:swift,
  secret_key: 04ZuTaKP8Eq8WBW9fMZJItzkeSOpc9jJkAdSe4pO}],


I have permission at my subusers by cons in keys I do not understand 
why I have:


{ user: johndoe:swift,
  access_key: UFSCBO5JXROB8641XF52,
  secret_key: }],

Thanks

On Wed, 01 Jul 2015 15:03:33 +0200, Mikaël Guichard wrote:

Hi,

I think it's because secret key for swift subuser is not generated :

radosgw-admin key create --subuser=johndoe:swift --key-type=swift
--gen-secret


Mikaël



Le 01/07/2015 14:50, Jimmy Goffaux a écrit :

radosgw-agent= 1.2.1trust
Ubuntu 14.04

English version :

Hello,

According to the documentation here: 
http://ceph.com/docs/master/radosgw/admin/
I followed to the letter the documentation and the result is totally 
different:


root@ih-prd-rgw01:~# radosgw-admin user create --uid=johndoe 
--display-name=John Doe --email=m...@email.com

{ user_id: johndoe,
[]
  subusers: [],
  keys: [
{ user: johndoe,
  access_key: SO4FYX3VXA8TO9D9AAM9,
  secret_key: 
tnOIYOuPztmWcnYfP5fGfPAWb5+thPqqdML0+Fmc}],

  swift_keys: [],
[...]

root@ih-prd-rgw01:~# radosgw-admin subuser create --uid=johndoe 
--subuser=johndoe:swift --access=full

{ user_id: johndoe,
[...]
  subusers: [
{ id: johndoe:swift,
  permissions: full-control}],
  keys: [
{ user: johndoe:swift,
  access_key: 6ENTC5V4OD15A3UO9B11,
  secret_key: },
{ user: johndoe,
  access_key: SO4FYX3VXA8TO9D9AAM9,
  secret_key: 
tnOIYOuPztmWcnYfP5fGfPAWb5+thPqqdML0+Fmc}],

  swift_keys: [],
[.]

Would you have any idea why the SWIFT user is not in the tags 
swift_keys?


Thanks...

French version :

Bonjour,

En suivant la documentation ici : 
http://ceph.com/docs/master/radosgw/admin/
J'ai suivi à la lettre la documentation  et le résultat est 
totalement différents :


root@ih-prd-rgw01:~# radosgw-admin user create --uid=johndoe 
--display-name=John Doe --email=ji...@goffaux.fr

{ user_id: johndoe,
[]
  subusers: [],
  keys: [
{ user: johndoe,
  access_key: SO4FYX3VXA8TO9D9AAM9,
  secret_key: 
tnOIYOuPztmWcnYfP5fGfPAWb5+thPqqdML0+Fmc}],

  swift_keys: [],
[...]

root@ih-prd-rgw01:~# radosgw-admin subuser create --uid=johndoe 
--subuser=johndoe:swift --access=full

{ user_id: johndoe,
[...]
  subusers: [
{ id: johndoe:swift,
  permissions: full-control}],
  keys: [
{ user: johndoe:swift,
  access_key: 6ENTC5V4OD15A3UO9B11,
  secret_key: },
{ user: johndoe,
  access_key: SO4FYX3VXA8TO9D9AAM9,
  secret_key: 
tnOIYOuPztmWcnYfP5fGfPAWb5+thPqqdML0+Fmc}],

  swift_keys: [],
[.]

Auriez-vous une idée pourquoi l'utilisateur SWIFT ne se trouve pas 
dans les balises swift_keys ?


Merci


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--

Jimmy Goffaux
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Rados gateway / RBD access restrictions

2015-07-01 Thread Jacek Jarosiewicz


ok, I think I found the answer to the second question:

http://wiki.ceph.com/Planning/Blueprints/Giant/Add_QoS_capacity_to_librbd

..librbd doesn't support any QoS for now..

Can anyone shed some light on the namespaces and limiting S3 users to 
one bucket?


J

On 07/01/2015 10:31 AM, Jacek Jarosiewicz wrote:

Hi,

I've been playing around with the rados gateway and RBD and have some
questions about user access restrictions. I'd like to be able to set up
a cluster that would be shared among different clients without any
conflicts...

Is there a way to limit S3/Swift clients to be able to write data only
to one bucket? Now S3 users can create their own buckets and as many as
they want - it would be good to have some kind of control over what user
can and can't do. I found this thread about namespaces:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-August/033451.html

..but it's old and I was wondering if maybe the namespace feature is
better documented now somewhere?

Another problem: is there a way to limit RBD clients to a certain
bandwidth and/or iops they can use? So that one client can't disrupt
another client's vm's for example?

Cheers,
J




--
Jacek Jarosiewicz
Administrator Systemów Informatycznych


SUPERMEDIA Sp. z o.o. z siedzibą w Warszawie
ul. Senatorska 13/15, 00-075 Warszawa
Sąd Rejonowy dla m.st.Warszawy, XII Wydział Gospodarczy Krajowego 
Rejestru Sądowego,

nr KRS 029537; kapitał zakładowy 42.756.000 zł
NIP: 957-05-49-503
Adres korespondencyjny: ul. Jubilerska 10, 04-190 Warszawa


SUPERMEDIA -   http://www.supermedia.pl
dostep do internetu - hosting - kolokacja - lacza - telefonia
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Error create subuser

2015-07-01 Thread Jimmy Goffaux

Yes it also works ... It's more that I do not expect to have an element 
johndoe:swift in keys


Thank you for providing answers.

On Wed, 01 Jul 2015 15:28:15 +0200, Mikaël Guichard wrote:

It's not really a problem, swift johndoe user works if he has a
record in swift_keys.

The s3 secret key of johndoe user is here :
keys: [
{ user: johndoe,
  access_key: 91KC4JI5BRO39A22JY9I,
  secret_key: Z5kLaBtg870xBhYtb4RKY82qGsbiqRpGs\/KQUXKF},

I tested swift and s3 access with the same configuration as yours and
it works for me.

Mikaël

Le 01/07/2015 15:09, Jimmy Goffaux a écrit :

{ user: johndoe:swift,
  access_key: UFSCBO5JXROB8641XF52,
  secret_key: }],


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--

Jimmy Goffaux
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Perfomance issue.

2015-07-01 Thread Emmanuel Florac

Le Tue, 16 Jun 2015 10:04:26 +0200
Marcus Forness pixel...@gmail.com écrivait:

 hi! anyone able to privide som tips on performance issue on a newly
 installe all flash ceph cluster? When we do write test we get 900MB/s
 write. but read tests are only 200MB/s all servers are on 10GBit
 connections.
 
 [global]
 fsid = 453d2db9-c764-4921-8f3c-ee0f75412e19
 mon_initial_members = ceph02, ceph03, ceph04
 mon_host = 10.129.23.202,10.129.23.203,10.129.23.204
 auth_cluster_required = cephx
 auth_service_required = cephx
 auth_client_required = cephx
 filestore_xattr_use_omap = true
 public_network = 10.129.0.0/16
 
 
 this is the cehph conf file

Did you test the local filesystem performance of your servers?

-- 

Emmanuel Florac |   Direction technique
|   Intellique
|   eflo...@intellique.com
|   +33 1 78 94 84 02

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Ceph erasure code benchmark failing

2015-07-01 Thread Nitin Saxena

Hi,

I am new to ceph project. I am trying to benchmark erasure code on Intel
and I am getting following error.

[root@nitin ceph]#
CEPH_ERASURE_CODE_BENCHMARK=src/ceph_erasure_code_benchmark
PLUGIN_DIRECTORY=src/.libs qa/workunits/erasure-code/bench.sh
seconds KB  plugin  k   m   work.   iter.   sizeeras.
command.
serie encode_vandermonde_isa
*load dlopen(src/.libs/libec_isa.so): src/.libs/libec_isa.so: cannot open
shared object file: No such file or directory*

I have checked out master branch and compiled as ceph with following steps

./autogen.sh ; ./configure --with-debug --without-tcmalloc
--without-fuse;make


Am I missing something here?

Thanks in advance
Nitin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph erasure code benchmark failing

2015-07-01 Thread David Casier AEVOO


Hi Nitin,
Are you installed YASM compiler ?

David

On 07/01/2015 01:46 PM, Nitin Saxena wrote:

Hi,

I am new to ceph project. I am trying to benchmark erasure code on 
Intel and I am getting following error.


[root@nitin ceph]# 
CEPH_ERASURE_CODE_BENCHMARK=src/ceph_erasure_code_benchmark 
PLUGIN_DIRECTORY=src/.libs qa/workunits/erasure-code/bench.sh
seconds KB  plugin  k   m   work.   iter. sizeeras.   
command.

serie encode_vandermonde_isa
*load dlopen(src/.libs/libec_isa.so): src/.libs/libec_isa.so: cannot 
open shared object file: No such file or directory*

*
*
I have checked out master branch and compiled as ceph with following steps

./autogen.sh ; ./configure --with-debug --without-tcmalloc 
--without-fuse;make



Am I missing something here?

Thanks in advance
Nitin


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--


Cordialement,

*David Casier

Ligne directe: 06 65 19 66 84
Email: david.cas...@aevoo.fr

*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Very low 4k randread performance ~1000iops

2015-07-01 Thread Mark Nelson




On 06/30/2015 10:42 PM, Tuomas Juntunen wrote:

Hi

For seq reads here's the latencies:
 lat (usec) : 2=0.01%, 10=0.01%, 20=0.01%, 50=0.02%, 100=0.03%
 lat (usec) : 250=1.02%, 500=87.09%, 750=7.47%, 1000=1.50%
 lat (msec) : 2=0.76%, 4=1.72%, 10=0.19%, 20=0.19%

Random reads:
 lat (usec) : 10=0.01%
 lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.03%, 50=0.55%
 lat (msec) : 100=99.31%, 250=0.08%

100msecs seems a lot to me.


It is, but what's more interesting imho is that it's so consistent.  You 
don't have some ops completing fast and other ones completing slowly 
holding everything up.  It's like the OSDs are simply overloaded with 
concurrent IOs and everything is waiting.  Maybe I'm confused, are your 
OSDs on SSDs?  Are there spinning disks involved?  If so, what model(s)?


You might want to use collectl -sD -oT on one of the OSD nodes during 
the test and see what the IO to the disk looks like during random reads 
and the especially with the svctime for the disks is like.


Mark



Br,T

-Original Message-
From: Mark Nelson [mailto:mnel...@redhat.com]
Sent: 30. kesäkuuta 2015 22:01
To: Tuomas Juntunen; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops

Seems reasonable.  What's the latency distribution look like in your fio
output file?  Would be useful to know if it's universally slow or if some
ops are taking much longer to complete than others.

Mark

On 06/30/2015 01:27 PM, Tuomas Juntunen wrote:

I created a file which has the following parameters


[random-read]
rw=randread
size=128m
directory=/root/asd
ioengine=libaio
bs=4k
#numjobs=8
iodepth=64


Br,T
-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
Of Mark Nelson
Sent: 30. kesäkuuta 2015 20:55
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops

Hi Tuomos,

Can you paste the command you ran to do the test?

Thanks,
Mark

On 06/30/2015 12:18 PM, Tuomas Juntunen wrote:

Hi

It’s not probably hitting the disks, but that really doesn’t matter.
The point is we have very responsive VM’s while writing and that is
what the users will see.

The iops we get with sequential read is good, but the random read is
way too low.

Is using SSD’s as OSD’s the only way to get it up? or is there some
tunable which would enhance it? I would assume Linux caches reads in
memory and serves them from there, but atleast now we don’t see it.

Br,

Tuomas

*From:*Somnath Roy [mailto:somnath@sandisk.com]
*Sent:* 30. kesäkuuta 2015 19:24
*To:* Tuomas Juntunen; 'ceph-users'
*Subject:* RE: [ceph-users] Very low 4k randread performance
~1000iops

Break it down, try fio-rbd to see what is the performance you getting..

But, I am really surprised you are getting  100k iops for write, did
you check it is hitting the disks ?

Thanks  Regards

Somnath

*From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On
Behalf Of *Tuomas Juntunen
*Sent:* Tuesday, June 30, 2015 8:33 AM
*To:* 'ceph-users'
*Subject:* [ceph-users] Very low 4k randread performance ~1000iops

Hi

I have been trying to figure out why our 4k random reads in VM’s are
so bad. I am using fio to test this.

Write : 170k iops

Random write : 109k iops

Read : 64k iops

Random read : 1k iops

Our setup is:

3 nodes with 36 OSDs, 18 SSD’s one SSD for two OSD’s, each node has
64gb mem  2x6core cpu’s

4 monitors running on other servers

40gbit infiniband with IPoIB

Openstack : Qemu-kvm for virtuals

Any help would be appreciated

Thank you in advance.

Br,

Tuomas

-
-
--


PLEASE NOTE: The information contained in this electronic mail
message is intended only for the use of the designated recipient(s) named

above.

If the reader of this message is not the intended recipient, you are
hereby notified that you have received this message in error and that
any review, dissemination, distribution, or copying of this message
is strictly prohibited. If you have received this communication in
error, please notify the sender by telephone or e-mail (as shown
above) immediately and destroy any and all copies of this message in
your possession (whether hard copies or electronically stored copies).



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Rados gateway / RBD access restrictions

2015-07-01 Thread Dan van der Ster

On Wed, Jul 1, 2015 at 3:10 PM, Jacek Jarosiewicz
jjarosiew...@supermedia.pl wrote:
 ok, I think I found the answer to the second question:

 http://wiki.ceph.com/Planning/Blueprints/Giant/Add_QoS_capacity_to_librbd

 ..librbd doesn't support any QoS for now..

But libvirt/qemu can do QoS: see iotune in https://libvirt.org/formatdomain.html

Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] xattrs vs omap

2015-07-01 Thread Adam Tygart

Hello all,

I've got a coworker who put filestore_xattr_use_omap = true in the
ceph.conf when we first started building the cluster. Now he can't
remember why. He thinks it may be a holdover from our first Ceph
cluster (running dumpling on ext4, iirc).

In the newly built cluster, we are using XFS with 2048 byte inodes,
running Ceph 0.94.2. It currently has production data in it.

From my reading of other threads, it looks like this is probably not
something you want set to true (at least on XFS), due to performance
implications. Is this something you can change on a running cluster?
Is it worth the hassle?

Thanks,
Adam
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph erasure code benchmark failing

2015-07-01 Thread Loic Dachary

Hi,

Like David said: the most probable cause is that there is no recent yasm 
installed. You can ./install-deps.sh to ensure the necessary dependencies are 
installed. 

Cheers

On 01/07/2015 13:46, Nitin Saxena wrote:
 Hi,
 
 I am new to ceph project. I am trying to benchmark erasure code on Intel and 
 I am getting following error.
 
 [root@nitin ceph]# 
 CEPH_ERASURE_CODE_BENCHMARK=src/ceph_erasure_code_benchmark 
 PLUGIN_DIRECTORY=src/.libs qa/workunits/erasure-code/bench.sh
 seconds KB  plugin  k   m   work.   iter.   sizeeras.   
 command.
 serie encode_vandermonde_isa
 *load dlopen(src/.libs/libec_isa.so): src/.libs/libec_isa.so: cannot open 
 shared object file: No such file or directory*
 *
 *
 I have checked out master branch and compiled as ceph with following steps
 
 ./autogen.sh ; ./configure --with-debug --without-tcmalloc --without-fuse;make
 
 
 Am I missing something here?
 
 Thanks in advance
 Nitin
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Node reboot -- OSDs not logging off from cluster

2015-07-01 Thread Gregory Farnum

On Tue, Jun 30, 2015 at 10:36 AM, Daniel Schneller
daniel.schnel...@centerdevice.com wrote:
 Hi!

 We are seeing a strange - and problematic - behavior in our 0.94.1
 cluster on Ubuntu 14.04.1. We have 5 nodes, 4 OSDs each.

 When rebooting one of the nodes (e. g. for a kernel upgrade) the OSDs
 do not seem to shut down correctly. Clients hang and ceph osd tree show
 the OSDs of that node still up. Repeated runs of ceph osd tree show
 them going down after a while. For instance, here OSD.7 is still up,
 even though the machine is in the middle of the reboot cycle.

 [C|root@control01]  ~ ➜  ceph osd tree
 # idweight  type name   up/down reweight
 -1  36.2root default
 -2  7.24host node01
 0   1.81osd.0   up  1
 5   1.81osd.5   up  1
 10  1.81osd.10  up  1
 15  1.81osd.15  up  1
 -3  7.24host node02
 1   1.81osd.1   up  1
 6   1.81osd.6   up  1
 11  1.81osd.11  up  1
 16  1.81osd.16  up  1
 -4  7.24host node03
 2   1.81osd.2   down1
 7   1.81osd.7   up  1
 12  1.81osd.12  down1
 17  1.81osd.17  down1
 -5  7.24host node04
 3   1.81osd.3   up  1
 8   1.81osd.8   up  1
 13  1.81osd.13  up  1
 18  1.81osd.18  up  1
 -6  7.24host node05
 4   1.81osd.4   up  1
 9   1.81osd.9   up  1
 14  1.81osd.14  up  1
 19  1.81osd.19  up  1

 So it seems, the services are either not shut down correctly when the
 reboot begins, or they do not get enough time to actually let the
 cluster know they are going away.

 If I stop the OSDs on that node manually before the reboot, everything
 works as expected and clients don't notice any interruptions.

 [C|root@node03]  ~ ➜  service ceph-osd stop id=2
 ceph-osd stop/waiting
 [C|root@node03]  ~ ➜  service ceph-osd stop id=7
 ceph-osd stop/waiting
 [C|root@node03]  ~ ➜  service ceph-osd stop id=12
 ceph-osd stop/waiting
 [C|root@node03]  ~ ➜  service ceph-osd stop id=17
 ceph-osd stop/waiting
 [C|root@node03]  ~ ➜  reboot

 The upstart file was not changed from the packaged version.
 Interestingly, the same Ceph version on a different cluster does _not_
 show this behaviour.

 Any ideas as to what is causing this or how to diagnose this?

I'm not sure why it would be happening, but:
* The OSDs send out shutdown messages to the monitor indicating
they're going away whenever they get shut down politely. There's a
short timeout to make sure they don't hang on you.
* The only way the OSD doesn't get marked down during reboot is if the
monitor doesn't get this message.
* If the monitor isn't getting the message, the OSD either isn't
sending the message or it's getting blocked.

My guess is that for some reason the OSDs are getting the shutdown
signal after the networking goes away.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] xattrs vs omap

2015-07-01 Thread Somnath Roy

It doesn't matter, I think filestore_xattr_use_omap is a 'noop'  and not used 
in the Hammer.

Thanks  Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Adam 
Tygart
Sent: Wednesday, July 01, 2015 8:20 AM
To: Ceph Users
Subject: [ceph-users] xattrs vs omap

Hello all,

I've got a coworker who put filestore_xattr_use_omap = true in the ceph.conf 
when we first started building the cluster. Now he can't remember why. He 
thinks it may be a holdover from our first Ceph cluster (running dumpling on 
ext4, iirc).

In the newly built cluster, we are using XFS with 2048 byte inodes, running 
Ceph 0.94.2. It currently has production data in it.

From my reading of other threads, it looks like this is probably not something 
you want set to true (at least on XFS), due to performance implications. Is 
this something you can change on a running cluster?
Is it worth the hassle?

Thanks,
Adam
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Very low 4k randread performance ~1000iops

2015-07-01 Thread Tuomas Juntunen

Hi

I'll check the possibility on testing EnhanceIO. I'll report back on this.

Thanks

Br,T

-Original Message-
From: Mark Nelson [mailto:mnel...@redhat.com] 
Sent: 1. heinäkuuta 2015 21:51
To: Tuomas Juntunen; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops



On 07/01/2015 01:39 PM, Tuomas Juntunen wrote:
 Thanks Mark

 Are there any plans for ZFS like L2ARC to CEPH or is the cache tiering 
 what should work like this in the future?

 I have tested cache tier + EC pool, and that created too much load on 
 our servers, so it was not viable to be used.

We are doing a lot of work in this space right now.  Hopefully we'll see
improvements coming in the coming releases.


 I was also wondering if EnhanceIO would be a good solution for getting 
 more random iops. I've read some Sébastien's writings.

Possibly!  Try it and let us know. ;)


 Br,
 Tuomas


 -Original Message-
 From: Mark Nelson [mailto:mnel...@redhat.com]
 Sent: 1. heinäkuuta 2015 20:29
 To: Tuomas Juntunen; ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops

 On 07/01/2015 12:13 PM, Tuomas Juntunen wrote:
 Hi

 Yes, the OSD's are on spinning disks and we have 18 SSD's for 
 journal, one SSD for two OSD's

 The OSD's are:
 Model Family: Seagate Barracuda 7200.14 (AF)
 Device Model: ST2000DM001-1CH164

 What I've understood the journals are not used as read cache at all, 
 just for writing. Would SSD based cache pool be viable solution here?

 Ok, so that makes more sense. The performance is still lower than 
 expected but maybe 3-4x rather than several orders of magnitude.  My 
 guess is that cache tiering in it's current form probably won't help 
 you much unless you have a workload that fits mostly into the cache.  
 The promotion penalty is really high though so we likely will have to 
 promote much more slowly than we currently do.

 Mark


 Br, T

 -Original Message-
 From: Mark Nelson [mailto:mnel...@redhat.com]
 Sent: 1. heinäkuuta 2015 13:58
 To: Tuomas Juntunen; ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops



 On 06/30/2015 10:42 PM, Tuomas Juntunen wrote:
 Hi

 For seq reads here's the latencies:
lat (usec) : 2=0.01%, 10=0.01%, 20=0.01%, 50=0.02%, 100=0.03%
lat (usec) : 250=1.02%, 500=87.09%, 750=7.47%, 1000=1.50%
lat (msec) : 2=0.76%, 4=1.72%, 10=0.19%, 20=0.19%

 Random reads:
lat (usec) : 10=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.03%, 50=0.55%
lat (msec) : 100=99.31%, 250=0.08%

 100msecs seems a lot to me.

 It is, but what's more interesting imho is that it's so consistent.  
 You don't have some ops completing fast and other ones completing 
 slowly
 holding
 everything up.  It's like the OSDs are simply overloaded with 
 concurrent
 IOs
 and everything is waiting.  Maybe I'm confused, are your OSDs on SSDs?
 Are
 there spinning disks involved?  If so, what model(s)?

 You might want to use collectl -sD -oT on one of the OSD nodes 
 during
 the
 test and see what the IO to the disk looks like during random reads 
 and
 the
 especially with the svctime for the disks is like.

 Mark


 Br,T

 -Original Message-
 From: Mark Nelson [mailto:mnel...@redhat.com]
 Sent: 30. kesäkuuta 2015 22:01
 To: Tuomas Juntunen; ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops

 Seems reasonable.  What's the latency distribution look like in your 
 fio output file?  Would be useful to know if it's universally slow 
 or if some ops are taking much longer to complete than others.

 Mark

 On 06/30/2015 01:27 PM, Tuomas Juntunen wrote:
 I created a file which has the following parameters


 [random-read]
 rw=randread
 size=128m
 directory=/root/asd
 ioengine=libaio
 bs=4k
 #numjobs=8
 iodepth=64


 Br,T
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On 
 Behalf Of Mark Nelson
 Sent: 30. kesäkuuta 2015 20:55
 To: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Very low 4k randread performance 
 ~1000iops

 Hi Tuomos,

 Can you paste the command you ran to do the test?

 Thanks,
 Mark

 On 06/30/2015 12:18 PM, Tuomas Juntunen wrote:
 Hi

 Its not probably hitting the disks, but that really doesnt matter.
 The point is we have very responsive VMs while writing and that 
 is what the users will see.

 The iops we get with sequential read is good, but the random read 
 is way too low.

 Is using SSDs as OSDs the only way to get it up? or is there 
 some tunable which would enhance it? I would assume Linux caches 
 reads in memory and serves them from there, but atleast now we dont
see it.

 Br,

 Tuomas

 *From:*Somnath Roy [mailto:somnath@sandisk.com]
 *Sent:* 30. kesäkuuta 2015 19:24
 *To:* Tuomas Juntunen; 'ceph-users'
 *Subject:* RE: [ceph-users] Very low 4k randread performance 
 ~1000iops

 Break it down,

[ceph-users] any recommendation of using EnhanceIO?

2015-07-01 Thread German Anders

Hi cephers,

   Is anyone out there that implement enhanceIO in a production
environment? any recommendation? any perf output to share with the diff
between using it and not?

Thanks in advance,

*German*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Redhat Storage Ceph Storage 1.3 released

2015-07-01 Thread Ken Dreyer

On 07/01/2015 03:02 PM, Vickey Singh wrote:
 - What's the exact version number of OpenSource Ceph is provided with
 this Product 

It is Hammer, specifically 0.94.1 with several critical bugfixes on top
as the product went through QE. All of the bugfixes have been proposed
or merged to Hammer upstream, IIRC, so the product has many of the
serious bug fixes that were in 0.94.2 or the upcoming 0.94.3.

 - RHCS 1.3 Features that are mentioned in the blog , will all of them
 present in open source Ceph.

Yep! That blog post describes many of the changes from Firefly - Hammer.

- Ken
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] One of our nodes has logs saying: wrongly marked me down

2015-07-01 Thread Somnath Roy

This can happen if your OSDs are flapping.. Hope your network is stable.

Thanks  Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Tuomas 
Juntunen
Sent: Wednesday, July 01, 2015 2:24 PM
To: 'ceph-users'
Subject: [ceph-users] One of our nodes has logs saying: wrongly marked me down

Hi

One our nodes has OSD logs that say wrongly marked me down for every OSD at 
some point. What could be the reason for this. Anyone have any similar 
experiences?

Other nodes work totally fine and they are all identical.

Br,T

PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] any recommendation of using EnhanceIO?

2015-07-01 Thread Dominik Zalewski

Hi,


I’ve asked same question last weeks or so (just search the mailing list
archives for EnhanceIO :) and got some interesting answers.


Looks like the project is pretty much dead since it was bought out by HGST.
Even their website has some broken links in regards to EnhanceIO


I’m keen to try flashcache or bcache (its been in the mainline kernel for
some time)


Dominik

On Wed, Jul 1, 2015 at 9:13 PM, German Anders gand...@despegar.com wrote:

 Hi cephers,

Is anyone out there that implement enhanceIO in a production
 environment? any recommendation? any perf output to share with the diff
 between using it and not?

 Thanks in advance,

 *German*

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Node reboot -- OSDs not logging off from cluster

2015-07-01 Thread Nick Fisk

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Gregory Farnum
 Sent: 01 July 2015 16:56
 To: Daniel Schneller
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Node reboot -- OSDs not logging off from cluster

 On Tue, Jun 30, 2015 at 10:36 AM, Daniel Schneller
 daniel.schnel...@centerdevice.com wrote:
  Hi!

  We are seeing a strange - and problematic - behavior in our 0.94.1
  cluster on Ubuntu 14.04.1. We have 5 nodes, 4 OSDs each.

  When rebooting one of the nodes (e. g. for a kernel upgrade) the OSDs
  do not seem to shut down correctly. Clients hang and ceph osd tree
  show the OSDs of that node still up. Repeated runs of ceph osd tree
  show them going down after a while. For instance, here OSD.7 is still
  up, even though the machine is in the middle of the reboot cycle.

  [C|root@control01]  ~ ➜  ceph osd tree
  # idweight  type name   up/down reweight
  -1  36.2root default
  -2  7.24host node01
  0   1.81osd.0   up  1
  5   1.81osd.5   up  1
  10  1.81osd.10  up  1
  15  1.81osd.15  up  1
  -3  7.24host node02
  1   1.81osd.1   up  1
  6   1.81osd.6   up  1
  11  1.81osd.11  up  1
  16  1.81osd.16  up  1
  -4  7.24host node03
  2   1.81osd.2   down1
  7   1.81osd.7   up  1
  12  1.81osd.12  down1
  17  1.81osd.17  down1
  -5  7.24host node04
  3   1.81osd.3   up  1
  8   1.81osd.8   up  1
  13  1.81osd.13  up  1
  18  1.81osd.18  up  1
  -6  7.24host node05
  4   1.81osd.4   up  1
  9   1.81osd.9   up  1
  14  1.81osd.14  up  1
  19  1.81osd.19  up  1

  So it seems, the services are either not shut down correctly when the
  reboot begins, or they do not get enough time to actually let the
  cluster know they are going away.

  If I stop the OSDs on that node manually before the reboot, everything
  works as expected and clients don't notice any interruptions.

  [C|root@node03]  ~ ➜  service ceph-osd stop id=2 ceph-osd stop/waiting
  [C|root@node03]  ~ ➜  service ceph-osd stop id=7 ceph-osd stop/waiting
  [C|root@node03]  ~ ➜  service ceph-osd stop id=12 ceph-osd
  stop/waiting [C|root@node03]  ~ ➜  service ceph-osd stop id=17
  ceph-osd stop/waiting [C|root@node03]  ~ ➜  reboot

  The upstart file was not changed from the packaged version.
  Interestingly, the same Ceph version on a different cluster does _not_
  show this behaviour.

  Any ideas as to what is causing this or how to diagnose this?

Do you have the OSD's running on the same boxes as the monitors? 

 I'm not sure why it would be happening, but:
 * The OSDs send out shutdown messages to the monitor indicating they're
 going away whenever they get shut down politely. There's a short timeout to
 make sure they don't hang on you.
 * The only way the OSD doesn't get marked down during reboot is if the
 monitor doesn't get this message.
 * If the monitor isn't getting the message, the OSD either isn't sending the
 message or it's getting blocked.

 My guess is that for some reason the OSDs are getting the shutdown signal
 after the networking goes away.
 -Greg
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Redhat Storage Ceph Storage 1.3 released

2015-07-01 Thread Vickey Singh

Hello Ceph lovers

You would have noticed that recently RedHat has released RedHat Ceph
Storage 1.3

http://redhatstorage.redhat.com/2015/06/25/announcing-red-hat-ceph-storage-1-3/

My question is

- What's the exact version number of OpenSource Ceph is provided with this
Product
- RHCS 1.3 Features that are mentioned in the blog , will all of them
present in open source Ceph.



Regards
Vickey
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] One of our nodes has logs saying: wrongly marked me down

2015-07-01 Thread Tuomas Juntunen

Hi

 

One our nodes has OSD logs that say wrongly marked me down for every OSD
at some point. What could be the reason for this. Anyone have any similar
experiences?

 

Other nodes work totally fine and they are all identical.

 

Br,T

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Redhat Storage Ceph Storage 1.3 released

2015-07-01 Thread Loic Dachary

Hi,

The details of the differences between the Hammer point releases and the RedHat 
Ceph Storage 1.3 can be listed as described at

http://www.spinics.net/lists/ceph-devel/msg24489.html reconciliation between 
hammer and v0.94.1.2

The same analysis should be done for 
https://github.com/ceph/ceph/releases/tag/v0.94.1.3 which presumably matches 
RedHat Ceph Storage 1.3.

Cheers

On 01/07/2015 23:02, Vickey Singh wrote:
 Hello Ceph lovers
 
 You would have noticed that recently RedHat has released RedHat Ceph Storage 
 1.3
 
 http://redhatstorage.redhat.com/2015/06/25/announcing-red-hat-ceph-storage-1-3/
 
 My question is 
 
 - What's the exact version number of OpenSource Ceph is provided with this 
 Product 
 - RHCS 1.3 Features that are mentioned in the blog , will all of them present 
 in open source Ceph.
 
 
 
 Regards
 Vickey
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Ceph Journal Disk Size

2015-07-01 Thread Nate Curry

I would like to get some clarification on the size of the journal disks
that I should get for my new Ceph cluster I am planning.  I read about the
journal settings on
http://ceph.com/docs/master/rados/configuration/osd-config-ref/#journal-settings
but that didn't really clarify it for me that or I just didn't get it.  I
found in the Learning Ceph Packt book it states that you should have one
disk for journalling for every 4 OSDs.  Using that as a reference I was
planning on getting multiple systems with 8 x 6TB inline SAS drives for
OSDs with two SSDs for journalling per host as well as 2 hot spares for the
6TB drives and 2 drives for the OS.  I was thinking of 400GB SSD drives but
am wondering if that is too much.  Any informed opinions would be
appreciated.

Thanks,

*Nate Curry*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Mon performance impact on OSDs?

2015-07-01 Thread Quentin Hartman

I've been wrestling with IO performance in my cluster and one area I have
not yet explored thoroughly is whether or not performance constraints on
mon hosts would be likely to have any impact on OSDs. My mons are quite
small, and one in particular has rather high IO waits (frequently 30% or
more) due to the other work it performs, notably hosting postgres for
Openstack which is quite chatty for some reason. Is this likely to
trickle-down into the OSDs performance? Everything I've seen online
indicates the performance between MONs and OSDs  should be decoupled, but
I'd like to hear some real world experiences.

Thanks!

QH
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Very low 4k randread performance ~1000iops

2015-07-01 Thread Mark Nelson


On 07/01/2015 12:13 PM, Tuomas Juntunen wrote:

Hi

Yes, the OSD's are on spinning disks and we have 18 SSD's for journal, one
SSD for two OSD's

The OSD's are:
Model Family: Seagate Barracuda 7200.14 (AF)
Device Model: ST2000DM001-1CH164

What I've understood the journals are not used as read cache at all, just
for writing. Would SSD based cache pool be viable solution here?


Ok, so that makes more sense. The performance is still lower than 
expected but maybe 3-4x rather than several orders of magnitude.  My 
guess is that cache tiering in it's current form probably won't help you 
much unless you have a workload that fits mostly into the cache.  The 
promotion penalty is really high though so we likely will have to 
promote much more slowly than we currently do.


Mark



Br, T

-Original Message-
From: Mark Nelson [mailto:mnel...@redhat.com]
Sent: 1. heinäkuuta 2015 13:58
To: Tuomas Juntunen; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops



On 06/30/2015 10:42 PM, Tuomas Juntunen wrote:

Hi

For seq reads here's the latencies:
  lat (usec) : 2=0.01%, 10=0.01%, 20=0.01%, 50=0.02%, 100=0.03%
  lat (usec) : 250=1.02%, 500=87.09%, 750=7.47%, 1000=1.50%
  lat (msec) : 2=0.76%, 4=1.72%, 10=0.19%, 20=0.19%

Random reads:
  lat (usec) : 10=0.01%
  lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.03%, 50=0.55%
  lat (msec) : 100=99.31%, 250=0.08%

100msecs seems a lot to me.


It is, but what's more interesting imho is that it's so consistent.  You
don't have some ops completing fast and other ones completing slowly holding
everything up.  It's like the OSDs are simply overloaded with concurrent IOs
and everything is waiting.  Maybe I'm confused, are your OSDs on SSDs?  Are
there spinning disks involved?  If so, what model(s)?

You might want to use collectl -sD -oT on one of the OSD nodes during the
test and see what the IO to the disk looks like during random reads and the
especially with the svctime for the disks is like.

Mark



Br,T

-Original Message-
From: Mark Nelson [mailto:mnel...@redhat.com]
Sent: 30. kesäkuuta 2015 22:01
To: Tuomas Juntunen; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops

Seems reasonable.  What's the latency distribution look like in your
fio output file?  Would be useful to know if it's universally slow or
if some ops are taking much longer to complete than others.

Mark

On 06/30/2015 01:27 PM, Tuomas Juntunen wrote:

I created a file which has the following parameters


[random-read]
rw=randread
size=128m
directory=/root/asd
ioengine=libaio
bs=4k
#numjobs=8
iodepth=64


Br,T
-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
Of Mark Nelson
Sent: 30. kesäkuuta 2015 20:55
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops

Hi Tuomos,

Can you paste the command you ran to do the test?

Thanks,
Mark

On 06/30/2015 12:18 PM, Tuomas Juntunen wrote:

Hi

It’s not probably hitting the disks, but that really doesn’t matter.
The point is we have very responsive VM’s while writing and that is
what the users will see.

The iops we get with sequential read is good, but the random read is
way too low.

Is using SSD’s as OSD’s the only way to get it up? or is there some
tunable which would enhance it? I would assume Linux caches reads in
memory and serves them from there, but atleast now we don’t see it.

Br,

Tuomas

*From:*Somnath Roy [mailto:somnath@sandisk.com]
*Sent:* 30. kesäkuuta 2015 19:24
*To:* Tuomas Juntunen; 'ceph-users'
*Subject:* RE: [ceph-users] Very low 4k randread performance
~1000iops

Break it down, try fio-rbd to see what is the performance you getting..

But, I am really surprised you are getting  100k iops for write,
did you check it is hitting the disks ?

Thanks  Regards

Somnath

*From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On
Behalf Of *Tuomas Juntunen
*Sent:* Tuesday, June 30, 2015 8:33 AM
*To:* 'ceph-users'
*Subject:* [ceph-users] Very low 4k randread performance ~1000iops

Hi

I have been trying to figure out why our 4k random reads in VM’s are
so bad. I am using fio to test this.

Write : 170k iops

Random write : 109k iops

Read : 64k iops

Random read : 1k iops

Our setup is:

3 nodes with 36 OSDs, 18 SSD’s one SSD for two OSD’s, each node has
64gb mem  2x6core cpu’s

4 monitors running on other servers

40gbit infiniband with IPoIB

Openstack : Qemu-kvm for virtuals

Any help would be appreciated

Thank you in advance.

Br,

Tuomas


-
-
--


PLEASE NOTE: The information contained in this electronic mail
message is intended only for the use of the designated recipient(s)
named

above.

If the reader of this message is not the intended recipient, you are
hereby notified that you have received this message in error and
that any review,

Re: [ceph-users] Very low 4k randread performance ~1000iops

2015-07-01 Thread Tuomas Juntunen

Hi

Yes, the OSD's are on spinning disks and we have 18 SSD's for journal, one
SSD for two OSD's

The OSD's are:
Model Family: Seagate Barracuda 7200.14 (AF)
Device Model: ST2000DM001-1CH164

What I've understood the journals are not used as read cache at all, just
for writing. Would SSD based cache pool be viable solution here?

Br, T

-Original Message-
From: Mark Nelson [mailto:mnel...@redhat.com] 
Sent: 1. heinäkuuta 2015 13:58
To: Tuomas Juntunen; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops



On 06/30/2015 10:42 PM, Tuomas Juntunen wrote:
 Hi

 For seq reads here's the latencies:
  lat (usec) : 2=0.01%, 10=0.01%, 20=0.01%, 50=0.02%, 100=0.03%
  lat (usec) : 250=1.02%, 500=87.09%, 750=7.47%, 1000=1.50%
  lat (msec) : 2=0.76%, 4=1.72%, 10=0.19%, 20=0.19%

 Random reads:
  lat (usec) : 10=0.01%
  lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.03%, 50=0.55%
  lat (msec) : 100=99.31%, 250=0.08%

 100msecs seems a lot to me.

It is, but what's more interesting imho is that it's so consistent.  You
don't have some ops completing fast and other ones completing slowly holding
everything up.  It's like the OSDs are simply overloaded with concurrent IOs
and everything is waiting.  Maybe I'm confused, are your OSDs on SSDs?  Are
there spinning disks involved?  If so, what model(s)?

You might want to use collectl -sD -oT on one of the OSD nodes during the
test and see what the IO to the disk looks like during random reads and the
especially with the svctime for the disks is like.

Mark


 Br,T

 -Original Message-
 From: Mark Nelson [mailto:mnel...@redhat.com]
 Sent: 30. kesäkuuta 2015 22:01
 To: Tuomas Juntunen; ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops

 Seems reasonable.  What's the latency distribution look like in your 
 fio output file?  Would be useful to know if it's universally slow or 
 if some ops are taking much longer to complete than others.

 Mark

 On 06/30/2015 01:27 PM, Tuomas Juntunen wrote:
 I created a file which has the following parameters


 [random-read]
 rw=randread
 size=128m
 directory=/root/asd
 ioengine=libaio
 bs=4k
 #numjobs=8
 iodepth=64


 Br,T
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf 
 Of Mark Nelson
 Sent: 30. kesäkuuta 2015 20:55
 To: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops

 Hi Tuomos,

 Can you paste the command you ran to do the test?

 Thanks,
 Mark

 On 06/30/2015 12:18 PM, Tuomas Juntunen wrote:
 Hi

 Its not probably hitting the disks, but that really doesnt matter.
 The point is we have very responsive VMs while writing and that is 
 what the users will see.

 The iops we get with sequential read is good, but the random read is 
 way too low.

 Is using SSDs as OSDs the only way to get it up? or is there some 
 tunable which would enhance it? I would assume Linux caches reads in 
 memory and serves them from there, but atleast now we dont see it.

 Br,

 Tuomas

 *From:*Somnath Roy [mailto:somnath@sandisk.com]
 *Sent:* 30. kesäkuuta 2015 19:24
 *To:* Tuomas Juntunen; 'ceph-users'
 *Subject:* RE: [ceph-users] Very low 4k randread performance 
 ~1000iops

 Break it down, try fio-rbd to see what is the performance you getting..

 But, I am really surprised you are getting  100k iops for write, 
 did you check it is hitting the disks ?

 Thanks  Regards

 Somnath

 *From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On 
 Behalf Of *Tuomas Juntunen
 *Sent:* Tuesday, June 30, 2015 8:33 AM
 *To:* 'ceph-users'
 *Subject:* [ceph-users] Very low 4k randread performance ~1000iops

 Hi

 I have been trying to figure out why our 4k random reads in VMs are 
 so bad. I am using fio to test this.

 Write : 170k iops

 Random write : 109k iops

 Read : 64k iops

 Random read : 1k iops

 Our setup is:

 3 nodes with 36 OSDs, 18 SSDs one SSD for two OSDs, each node has 
 64gb mem  2x6core cpus

 4 monitors running on other servers

 40gbit infiniband with IPoIB

 Openstack : Qemu-kvm for virtuals

 Any help would be appreciated

 Thank you in advance.

 Br,

 Tuomas

 
 -
 -
 --


 PLEASE NOTE: The information contained in this electronic mail 
 message is intended only for the use of the designated recipient(s) 
 named
 above.
 If the reader of this message is not the intended recipient, you are 
 hereby notified that you have received this message in error and 
 that any review, dissemination, distribution, or copying of this 
 message is strictly prohibited. If you have received this 
 communication in error, please notify the sender by telephone or 
 e-mail (as shown
 above) immediately and destroy any and all copies of this message in 
 your possession (whether hard copies or electronically stored copies).

[ceph-users] file/directory invisible through ceph-fuse

2015-07-01 Thread flisky


Hi list,

I meet a strange problem:

sometimes I cannot see the file/directory created by another ceph-fuse 
client. It comes into visible after I touch/mkdir the same name.


Any thoughts?

Thanks!

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Round-trip time for monitors

2015-07-01 Thread - -

Hi everybody,

We have 3 monitors in our ceph cluster: 2 in one local site (2 data centers a
few km away from each other), and the 3rd one on a remote site, with a maximum
round-trip time (RTT) of 30ms between the local site and the remote site. All
OSDs run on the local site. The reason for the remote monitor is to keep the
cluster running if any DC fails.

Is that a valid configuration ? What is the maximum RTT valid in such a Ceph
cluster ? 


Here are some details about our running cluster:
Current monmap:
---
epoch 4
fsid ...
last_changed 2015-05-12 08:39:35.600843
created 0.00
0: IP addr local0:6789/0 mon.local0
1: IP addr local1:6789/0 mon.local1
2: IP addr remote:6789/0 mon.remote
---

In our running cluster, the mon logs show that the leader monitor is on the
local site, while the other two are peons.

Being curious, I increased runtime log-level debug settings for a few subsystems
(ms, mon, paxos...) to see if there was some kind of heartbeat between the
monitors. I noticed messages such as these ones...
--
2015-07-01 07:01:05.840845 7fd569bbe700  1 -- IP local1:6789/0 -- mon.0 IP
local0:6789/0 -- mon_health( service 1 op tell e 0 r 0 ) v1 -- ?+0 0x3b9b200
2015-07-01 07:01:05.840871 7fd569bbe700 20 -- IP local1:6789/0 submit_message
mon_health( service 1 op tell e 0 r 0 ) v1 remote, IP local0:6789/0, have
pipe.
2015-07-01 07:01:05.840885 7fd569bbe700  1 -- IP local1:6789/0 -- mon.2 IP
remote:6789/0 -- mon_health( service 1 op tell e 0 r 0 ) v1 -- ?+0 0x3b98a00
2015-07-01 07:01:05.840894 7fd569bbe700 20 -- IP local1:6789/0 submit_message
mon_health( service 1 op tell e 0 r 0 ) v1 remote, IP remote:6789/0, have
pipe.
--
... but none which tells me what I want: the idea was to see if anybody could
complain about a high RTT, and to monitor that value. Any idea on how to do it ?

Thank you.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Round-trip time for monitors

2015-07-01 Thread Wido den Hollander

On 07/01/2015 09:38 AM, - - wrote:
 Hi everybody,
 
 We have 3 monitors in our ceph cluster: 2 in one local site (2 data centers a
 few km away from each other), and the 3rd one on a remote site, with a maximum
 round-trip time (RTT) of 30ms between the local site and the remote site. All
 OSDs run on the local site. The reason for the remote monitor is to keep the
 cluster running if any DC fails.
 
 Is that a valid configuration ? What is the maximum RTT valid in such a Ceph
 cluster ? 
 

Well, I think that 30ms is a bit high. You'll get into a clock-drift
situation a lot earlier.

The leading monitor uses it's local time, sends out the packet, which
then arrives 15ms later at the other mon. For the monitors that is a
clock drift of 15ms at least.

Also, it could be that the monitor on the remote site is elected as
leader and that will cause all your OSD traffic to go via that Monitor.

In big changes in the cluster it will add at least 30ms of latency to
certain requests which slows down the cluster.

 
 Here are some details about our running cluster:
 Current monmap:
 ---
 epoch 4
 fsid ...
 last_changed 2015-05-12 08:39:35.600843
 created 0.00
 0: IP addr local0:6789/0 mon.local0
 1: IP addr local1:6789/0 mon.local1
 2: IP addr remote:6789/0 mon.remote
 ---
 
 In our running cluster, the mon logs show that the leader monitor is on the
 local site, while the other two are peons.
 
 Being curious, I increased runtime log-level debug settings for a few 
 subsystems
 (ms, mon, paxos...) to see if there was some kind of heartbeat between the
 monitors. I noticed messages such as these ones...
 --
 2015-07-01 07:01:05.840845 7fd569bbe700  1 -- IP local1:6789/0 -- mon.0 IP
 local0:6789/0 -- mon_health( service 1 op tell e 0 r 0 ) v1 -- ?+0 0x3b9b200
 2015-07-01 07:01:05.840871 7fd569bbe700 20 -- IP local1:6789/0 
 submit_message
 mon_health( service 1 op tell e 0 r 0 ) v1 remote, IP local0:6789/0, have
 pipe.
 2015-07-01 07:01:05.840885 7fd569bbe700  1 -- IP local1:6789/0 -- mon.2 IP
 remote:6789/0 -- mon_health( service 1 op tell e 0 r 0 ) v1 -- ?+0 0x3b98a00
 2015-07-01 07:01:05.840894 7fd569bbe700 20 -- IP local1:6789/0 
 submit_message
 mon_health( service 1 op tell e 0 r 0 ) v1 remote, IP remote:6789/0, have
 pipe.
 --
 ... but none which tells me what I want: the idea was to see if anybody could
 complain about a high RTT, and to monitor that value. Any idea on how to do 
 it ?
 
 Thank you.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CDS Jewel Wed/Thurs

2015-07-01 Thread Zhou, Yuan

Hey Patrick, 

Looks like the GMT+8 time for the 1st day is wrong, should be 10:00 pm - 7:30 
am?

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Patrick McGarry
Sent: Tuesday, June 30, 2015 11:28 PM
To: Ceph Devel; Ceph-User
Subject: CDS Jewel Wed/Thurs

Hey cephers,

Just a friendly reminder that our Ceph Developer Summit for Jewel planning is 
set to run tomorrow and Thursday. The schedule and dial in information is 
available on the new wiki:

http://tracker.ceph.com/projects/ceph/wiki/CDS_Jewel

Please let me know if you have any questions. Thanks!

-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com @scuttlemonkey || @ceph
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in the 
body of a message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Round-trip time for monitors

2015-07-01 Thread Gregory Farnum

On Wed, Jul 1, 2015 at 8:38 AM, - - francois.pe...@san-services.com wrote:
 Hi everybody,

 We have 3 monitors in our ceph cluster: 2 in one local site (2 data centers a
 few km away from each other), and the 3rd one on a remote site, with a maximum
 round-trip time (RTT) of 30ms between the local site and the remote site. All
 OSDs run on the local site. The reason for the remote monitor is to keep the
 cluster running if any DC fails.

 Is that a valid configuration ? What is the maximum RTT valid in such a Ceph
 cluster ?


 Here are some details about our running cluster:
 Current monmap:
 ---
 epoch 4
 fsid ...
 last_changed 2015-05-12 08:39:35.600843
 created 0.00
 0: IP addr local0:6789/0 mon.local0
 1: IP addr local1:6789/0 mon.local1
 2: IP addr remote:6789/0 mon.remote
 ---

 In our running cluster, the mon logs show that the leader monitor is on the
 local site, while the other two are peons.

 Being curious, I increased runtime log-level debug settings for a few 
 subsystems
 (ms, mon, paxos...) to see if there was some kind of heartbeat between the
 monitors. I noticed messages such as these ones...
 --
 2015-07-01 07:01:05.840845 7fd569bbe700  1 -- IP local1:6789/0 -- mon.0 IP
 local0:6789/0 -- mon_health( service 1 op tell e 0 r 0 ) v1 -- ?+0 0x3b9b200
 2015-07-01 07:01:05.840871 7fd569bbe700 20 -- IP local1:6789/0 
 submit_message
 mon_health( service 1 op tell e 0 r 0 ) v1 remote, IP local0:6789/0, have
 pipe.
 2015-07-01 07:01:05.840885 7fd569bbe700  1 -- IP local1:6789/0 -- mon.2 IP
 remote:6789/0 -- mon_health( service 1 op tell e 0 r 0 ) v1 -- ?+0 0x3b98a00
 2015-07-01 07:01:05.840894 7fd569bbe700 20 -- IP local1:6789/0 
 submit_message
 mon_health( service 1 op tell e 0 r 0 ) v1 remote, IP remote:6789/0, have
 pipe.
 --
 ... but none which tells me what I want: the idea was to see if anybody could
 complain about a high RTT, and to monitor that value. Any idea on how to do 
 it ?

I don't think there's anything that monitors RTT directly, but 30 ms
shouldn't be a problem; that's an order of magnitude or more below all
the various timeout thresholds. The clock skew detection might need to
be loosened up but I'm not very familiar with how that bit works, and
it's not quite as crucial anyway. :)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] file/directory invisible through ceph-fuse

2015-07-01 Thread Gregory Farnum

On Wed, Jul 1, 2015 at 9:02 AM, flisky yinjif...@lianjia.com wrote:
 Hi list,

 I meet a strange problem:

 sometimes I cannot see the file/directory created by another ceph-fuse
 client. It comes into visible after I touch/mkdir the same name.

 Any thoughts?

What version are you running? We've seen a few things like this with
older releases, although usually it's in the kernel...
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] One of our nodes has logs saying: wrongly marked me down

2015-07-01 Thread Tuomas Juntunen

Ive checked the network, we use IPoIB and all nodes are connected to the
same switch, there are no breaks in connectivity while this happens. My
constant ping says 0.03  0.1ms. I would say this is ok.

 

This happens almost every time when deep scrubbing is running. Our loads on
this particular server goes to 300+ and osds are marked down.

 

Any suggestions on settings? I now have the following settings that might
affect this

 

[global]

 osd_op_threads = 6

 osd_op_num_threads_per_shard = 1

 osd_op_num_shards = 25

 #osd_op_num_sharded_pool_threads = 25

 filestore_op_threads = 6

 ms_nocrc = true

 filestore_fd_cache_size = 64

 filestore_fd_cache_shards = 32

 ms_dispatch_throttle_bytes = 0

 throttler_perf_counter = false

 

[osd]

 osd scrub load threshold = 0.1

 osd max backfills = 1

 osd recovery max active = 1

 osd scrub sleep = .1

 osd disk thread ioprio class = idle

 osd disk thread ioprio priority = 7

 osd scrub chunk max = 5

 osd deep scrub stride = 1048576

 filestore queue max ops = 1

 filestore max sync interval = 30

 filestore min sync interval = 29

 osd_client_message_size_cap = 0

 osd_client_message_cap = 0

 osd_enable_op_tracker = false

 

Br, T

 

 

From: Somnath Roy [mailto:somnath@sandisk.com] 
Sent: 2. heinäkuuta 2015 0:30
To: Tuomas Juntunen; 'ceph-users'
Subject: RE: [ceph-users] One of our nodes has logs saying: wrongly marked
me down

 

This can happen if your OSDs are flapping.. Hope your network is stable.

 

Thanks  Regards

Somnath

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Tuomas Juntunen
Sent: Wednesday, July 01, 2015 2:24 PM
To: 'ceph-users'
Subject: [ceph-users] One of our nodes has logs saying: wrongly marked me
down

 

Hi

 

One our nodes has OSD logs that say wrongly marked me down for every OSD
at some point. What could be the reason for this. Anyone have any similar
experiences?

 

Other nodes work totally fine and they are all identical.

 

Br,T

 

  _  


PLEASE NOTE: The information contained in this electronic mail message is
intended only for the use of the designated recipient(s) named above. If the
reader of this message is not the intended recipient, you are hereby
notified that you have received this message in error and that any review,
dissemination, distribution, or copying of this message is strictly
prohibited. If you have received this communication in error, please notify
the sender by telephone or e-mail (as shown above) immediately and destroy
any and all copies of this message in your possession (whether hard copies
or electronically stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] One of our nodes has logs saying: wrongly marked me down

2015-07-01 Thread Somnath Roy

Yeah, this can happen during deep_scrub and also during rebalancing..I forgot 
to mention that..
Generally, it is a good idea to throttle those..For deep scrub, you can try 
using (got it from old post, I never used it)


osd_scrub_chunk_min = 1

osd_scrub_chunk_max = 1

osd_scrub_sleep = 0.1



For rebalancing I think you are already using proper value..



But, I don't think this will eliminate the scenario all together but should 
alleviate it a bit.



Also, why you are using so many shards ? How many OSDs you are running in a box 
? shard 25 should be good if you are running with single OSD, IF you have lot 
of OSDs in a box, try to reduce it ~5 or so.



Thanks  Regards

Somnath


From: Tuomas Juntunen [mailto:tuomas.juntu...@databasement.fi]
Sent: Wednesday, July 01, 2015 8:18 PM
To: Somnath Roy; 'ceph-users'
Subject: RE: [ceph-users] One of our nodes has logs saying: wrongly marked me 
down

I've checked the network, we use IPoIB and all nodes are connected to the same 
switch, there are no breaks in connectivity while this happens. My constant 
ping says 0.03 - 0.1ms. I would say this is ok.

This happens almost every time when deep scrubbing is running. Our loads on 
this particular server goes to 300+ and osd's are marked down.

Any suggestions on settings? I now have the following settings that might 
affect this

[global]
 osd_op_threads = 6
 osd_op_num_threads_per_shard = 1
 osd_op_num_shards = 25
 #osd_op_num_sharded_pool_threads = 25
 filestore_op_threads = 6
 ms_nocrc = true
 filestore_fd_cache_size = 64
 filestore_fd_cache_shards = 32
 ms_dispatch_throttle_bytes = 0
 throttler_perf_counter = false

[osd]
 osd scrub load threshold = 0.1
 osd max backfills = 1
 osd recovery max active = 1
 osd scrub sleep = .1
 osd disk thread ioprio class = idle
 osd disk thread ioprio priority = 7
 osd scrub chunk max = 5
 osd deep scrub stride = 1048576
 filestore queue max ops = 1
 filestore max sync interval = 30
 filestore min sync interval = 29
 osd_client_message_size_cap = 0
 osd_client_message_cap = 0
 osd_enable_op_tracker = false

Br, T


From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: 2. heinäkuuta 2015 0:30
To: Tuomas Juntunen; 'ceph-users'
Subject: RE: [ceph-users] One of our nodes has logs saying: wrongly marked me 
down

This can happen if your OSDs are flapping.. Hope your network is stable.

Thanks  Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Tuomas 
Juntunen
Sent: Wednesday, July 01, 2015 2:24 PM
To: 'ceph-users'
Subject: [ceph-users] One of our nodes has logs saying: wrongly marked me down

Hi

One our nodes has OSD logs that say wrongly marked me down for every OSD at 
some point. What could be the reason for this. Anyone have any similar 
experiences?

Other nodes work totally fine and they are all identical.

Br,T



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph Journal Disk Size

2015-07-01 Thread German Anders

I would probably go with less size osd disks, 4TB is to much to loss in
case of a broken disk, so maybe more osd daemons with less size, maybe 1TB
or 2TB size. 4:1 relationship is good enough, also i think that 200G disk
for the journals would be ok, so you can save some money there, the osd's
of course configured them as a JBOD, don't use any RAID under it, and use
two different networks for public and cluster net.

*German*

2015-07-01 18:49 GMT-03:00 Nate Curry cu...@mosaicatm.com:

 I would like to get some clarification on the size of the journal disks
 that I should get for my new Ceph cluster I am planning.  I read about the
 journal settings on
 http://ceph.com/docs/master/rados/configuration/osd-config-ref/#journal-settings
 but that didn't really clarify it for me that or I just didn't get it.  I
 found in the Learning Ceph Packt book it states that you should have one
 disk for journalling for every 4 OSDs.  Using that as a reference I was
 planning on getting multiple systems with 8 x 6TB inline SAS drives for
 OSDs with two SSDs for journalling per host as well as 2 hot spares for the
 6TB drives and 2 drives for the OS.  I was thinking of 400GB SSD drives but
 am wondering if that is too much.  Any informed opinions would be
 appreciated.

 Thanks,

 *Nate Curry*


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to use cgroup to bind ceph-osd to a specific cpu core?

2015-07-01 Thread Jan Schermer

Re: your previous question

I will not elaborate on this much more, I hope some of you will try it if you 
have NUMA systems and see for yourself.

But I can recommend some docs:
http://globalsp.ts.fujitsu.com/dmsp/Publications/public/wp-ivy-bridge-ep-memory-performance-ww-en.pdf
 
http://globalsp.ts.fujitsu.com/dmsp/Publications/public/wp-ivy-bridge-ep-memory-performance-ww-en.pdf

http://events.linuxfoundation.org/sites/events/files/eeus13_shelton.pdf 
http://events.linuxfoundation.org/sites/events/files/eeus13_shelton.pdf

RHEL also has some nice documentation on the issue. If you don’t use ancient 
(like RHEL6) systems then your OS+kernel should do the “right thing” by default 
and take NUMA locality into account when scheduling and migrating.

Jan


 On 01 Jul 2015, at 03:02, Ray Sun xiaoq...@gmail.com wrote:
 
 Jan,
 Thanks a lot. I can do my contribution to this project if I can.
 
 Best Regards
 -- Ray
 
 On Tue, Jun 30, 2015 at 11:50 PM, Jan Schermer j...@schermer.cz 
 mailto:j...@schermer.cz wrote:
 Hi all,
 our script is available on GitHub
 
 https://github.com/prozeta/pincpus https://github.com/prozeta/pincpus
 
 I haven’t had much time to do a proper README, but I hope the configuration 
 is self explanatory enough for now.
 What it does is pin each OSD into the most “empty” cgroup assigned to a NUMA 
 node.
 
 Let me know how it works for you!
 
 Jan
 
 
 On 30 Jun 2015, at 10:50, Huang Zhiteng winsto...@gmail.com 
 mailto:winsto...@gmail.com wrote:
 
 
 
 On Tue, Jun 30, 2015 at 4:25 PM, Jan Schermer j...@schermer.cz 
 mailto:j...@schermer.cz wrote:
 Not having OSDs and KVMs compete against each other is one thing.
 But there are more reasons to do this
 
 1) not moving the processes and threads between cores that much (better 
 cache utilization)
 2) aligning the processes with memory on NUMA systems (that means all modern 
 dual socket systems) - you don’t want your OSD running on CPU1 with memory 
 allocated to CPU2
 3) the same goes for other resources like NICs or storage controllers - but 
 that’s less important and not always practical to do
 4) you can limit the scheduling domain on linux if you limit the cpuset for 
 your OSDs (I’m not sure how important this is, just best practice)
 5) you can easily limit memory or CPU usage, set priority, with much greater 
 granularity than without cgroups
 6) if you have HyperThreading enabled you get the most gain when the 
 workloads on the threads are dissimiliar - so to have the higher throughput 
 you have to pin OSD to thread1 and KVM to thread2 on the same core. We’re 
 not doing that because latency and performance of the core can vary 
 depending on what the other thread is doing. But it might be useful to 
 someone.
 
 Some workloads exhibit 100% performance gain when everything aligns in a 
 NUMA system, compared to a SMP mode on the same hardware. You likely won’t 
 notice it on light workloads, as the interconnects (QPI) are very fast and 
 there’s a lot of bandwidth, but for stuff like big OLAP databases or other 
 data-manipulation workloads there’s a huge difference. And with CEPH being 
 CPU hungy and memory intensive, we’re seeing some big gains here just by 
 co-locating the memory with the processes….
 Could you elaborate a it on this?  I'm interested to learn in what situation 
 memory locality helps Ceph to what extend. 
 
 
 Jan
 
  
 On 30 Jun 2015, at 08:12, Ray Sun xiaoq...@gmail.com 
 mailto:xiaoq...@gmail.com wrote:
 
 Sound great, any update please let me know.
 
 Best Regards
 -- Ray
 
 On Tue, Jun 30, 2015 at 1:46 AM, Jan Schermer j...@schermer.cz 
 mailto:j...@schermer.cz wrote:
 I promised you all our scripts for automatic cgroup assignment - they are 
 in our production already and I just need to put them on github, stay tuned 
 tomorrow :-)
 
 Jan
 
 
 On 29 Jun 2015, at 19:41, Somnath Roy somnath@sandisk.com 
 mailto:somnath@sandisk.com wrote:
 
 Presently, you have to do it by using tool like ‘taskset’ or ‘numactl’…
  
 Thanks  Regards
 Somnath
  
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
 mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Ray Sun
 Sent: Monday, June 29, 2015 9:19 AM
 To: ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
 Subject: [ceph-users] How to use cgroup to bind ceph-osd to a specific cpu 
 core?
  
 Cephers,
 I want to bind each of my ceph-osd to a specific cpu core, but I didn't 
 find any document to explain that, could any one can provide me some 
 detailed information. Thanks.
  
 Currently, my ceph is running like this:
  
 oot  28692  1  0 Jun23 ?00:37:26 /usr/bin/ceph-mon -i 
 seed.econe.com http://seed.econe.com/ --pid-file 
 /var/run/ceph/mon.seed.econe.com.pid -c /etc/ceph/ceph.conf --cluster ceph
 root  40063  1  1 Jun23 ?02:13:31 /usr/bin/ceph-osd -i 0 
 --pid-file /var/run/ceph/osd.0.pid -c /etc/ceph/ceph.conf --cluster ceph
 root  42096  1  0 Jun23 ?01:33:42 /usr/bin/ceph-osd -i

Re: [ceph-users] xattrs vs omap

2015-07-01 Thread Christian Balzer


Hello,

On Wed, 1 Jul 2015 15:24:13 + Somnath Roy wrote:

 It doesn't matter, I think filestore_xattr_use_omap is a 'noop'  and not
 used in the Hammer.
 
Then what was this functionality replaced with, esp. considering EXT4
based OSDs?

Chibi
 Thanks  Regards
 Somnath
 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Adam Tygart Sent: Wednesday, July 01, 2015 8:20 AM
 To: Ceph Users
 Subject: [ceph-users] xattrs vs omap
 
 Hello all,
 
 I've got a coworker who put filestore_xattr_use_omap = true in the
 ceph.conf when we first started building the cluster. Now he can't
 remember why. He thinks it may be a holdover from our first Ceph cluster
 (running dumpling on ext4, iirc).
 
 In the newly built cluster, we are using XFS with 2048 byte inodes,
 running Ceph 0.94.2. It currently has production data in it.
 
 From my reading of other threads, it looks like this is probably not
 something you want set to true (at least on XFS), due to performance
 implications. Is this something you can change on a running cluster? Is
 it worth the hassle?
 
 Thanks,
 Adam
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 PLEASE NOTE: The information contained in this electronic mail message
 is intended only for the use of the designated recipient(s) named above.
 If the reader of this message is not the intended recipient, you are
 hereby notified that you have received this message in error and that
 any review, dissemination, distribution, or copying of this message is
 strictly prohibited. If you have received this communication in error,
 please notify the sender by telephone or e-mail (as shown above)
 immediately and destroy any and all copies of this message in your
 possession (whether hard copies or electronically stored copies).
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph Journal Disk Size

2015-07-01 Thread Shane Gibson


It also depends a lot on the size of your cluster ... I have a test cluster I'm 
standing up right now with 60 nodes - a total of 600 OSDs each at 4 TB ... If I 
lose 4 TB - that's a very small fraction of the data.  My replicas are going to 
be spread out across a lot of spindles, and replicating that missing 4 TB isn't 
much of an issue, across 3 racks each with 80 gbit/sec ToR uplinks to Spine.  
Each node has 20 gbit/sec to ToR in a bond.

On the other hand ... if you only have 4 .. or 8 ... or 10 servers ... and a 
smaller number of OSDs - you have fewer spindles replicating that loss, and it 
might be more of an issue.

It just depends on the size/scale of  your environment.

We're going to 8 TB drives - and that will ultimately be spread over a 100 or 
more physical servers w/ 10 OSD disks per server.   This will be across 7 to 10 
racks (same network topology) ... so an 8 TB drive loss isn't too big of an 
issue.   Now that assumes that replication actually works well in that size 
cluster.  We're still cessing out this part of the PoC engagement.

~~shane



On 7/1/15, 5:05 PM, ceph-users on behalf of German Anders 
ceph-users-boun...@lists.ceph.commailto:ceph-users-boun...@lists.ceph.com on 
behalf of gand...@despegar.commailto:gand...@despegar.com wrote:

ask the other guys on the list, but for me to lose 4TB of data is to much, the 
cluster will still running fine, but in some point you need to recover that 
disk, and also if you lose one server with all the 4TB disk in that case yeah 
it will hurt the cluster, also take into account that with that kind of disk 
you will get no more than 100-110 iops per disk


German Anders
Storage System Engineer Leader
Despegar | IT Team
office +54 11 4894 3500 x3408
mobile +54 911 3493 7262
mail gand...@despegar.commailto:gand...@despegar.com

2015-07-01 20:54 GMT-03:00 Nate Curry 
cu...@mosaicatm.commailto:cu...@mosaicatm.com:

4TB is too much to lose?  Why would it matter if you lost one 4TB with the 
redundancy?  Won't it auto recover from the disk failure?

Nate Curry

On Jul 1, 2015 6:12 PM, German Anders 
gand...@despegar.commailto:gand...@despegar.com wrote:
I would probably go with less size osd disks, 4TB is to much to loss in case of 
a broken disk, so maybe more osd daemons with less size, maybe 1TB or 2TB size. 
4:1 relationship is good enough, also i think that 200G disk for the journals 
would be ok, so you can save some money there, the osd's of course configured 
them as a JBOD, don't use any RAID under it, and use two different networks for 
public and cluster net.


German

2015-07-01 18:49 GMT-03:00 Nate Curry 
cu...@mosaicatm.commailto:cu...@mosaicatm.com:
I would like to get some clarification on the size of the journal disks that I 
should get for my new Ceph cluster I am planning.  I read about the journal 
settings on 
http://ceph.com/docs/master/rados/configuration/osd-config-ref/#journal-settings
 but that didn't really clarify it for me that or I just didn't get it.  I 
found in the Learning Ceph Packt book it states that you should have one disk 
for journalling for every 4 OSDs.  Using that as a reference I was planning on 
getting multiple systems with 8 x 6TB inline SAS drives for OSDs with two SSDs 
for journalling per host as well as 2 hot spares for the 6TB drives and 2 
drives for the OS.  I was thinking of 400GB SSD drives but am wondering if that 
is too much.  Any informed opinions would be appreciated.

Thanks,

Nate Curry


___
ceph-users mailing list
ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph Journal Disk Size

2015-07-01 Thread German Anders

I'm interested in such a configuration, can you share some perfomance
test/numbers?

Thanks in advance,

Best regards,

*German*

2015-07-01 21:16 GMT-03:00 Shane Gibson shane_gib...@symantec.com:


 It also depends a lot on the size of your cluster ... I have a test
 cluster I'm standing up right now with 60 nodes - a total of 600 OSDs each
 at 4 TB ... If I lose 4 TB - that's a very small fraction of the data.  My
 replicas are going to be spread out across a lot of spindles, and
 replicating that missing 4 TB isn't much of an issue, across 3 racks each
 with 80 gbit/sec ToR uplinks to Spine.  Each node has 20 gbit/sec to ToR in
 a bond.

 On the other hand ... if you only have 4 .. or 8 ... or 10 servers ... and
 a smaller number of OSDs - you have fewer spindles replicating that loss,
 and it might be more of an issue.

 It just depends on the size/scale of  your environment.

 We're going to 8 TB drives - and that will ultimately be spread over a 100
 or more physical servers w/ 10 OSD disks per server.   This will be across
 7 to 10 racks (same network topology) ... so an 8 TB drive loss isn't too
 big of an issue.   Now that assumes that replication actually works well in
 that size cluster.  We're still cessing out this part of the PoC
 engagement.

 ~~shane




 On 7/1/15, 5:05 PM, ceph-users on behalf of German Anders 
 ceph-users-boun...@lists.ceph.com on behalf of gand...@despegar.com
 wrote:

 ask the other guys on the list, but for me to lose 4TB of data is to much,
 the cluster will still running fine, but in some point you need to recover
 that disk, and also if you lose one server with all the 4TB disk in that
 case yeah it will hurt the cluster, also take into account that with that
 kind of disk you will get no more than 100-110 iops per disk

 *German Anders*
 Storage System Engineer Leader
 *Despegar* | IT Team
 *office* +54 11 4894 3500 x3408
 *mobile* +54 911 3493 7262
 *mail* gand...@despegar.com

 2015-07-01 20:54 GMT-03:00 Nate Curry cu...@mosaicatm.com:

 4TB is too much to lose?  Why would it matter if you lost one 4TB with
 the redundancy?  Won't it auto recover from the disk failure?

 Nate Curry
 On Jul 1, 2015 6:12 PM, German Anders gand...@despegar.com wrote:

 I would probably go with less size osd disks, 4TB is to much to loss in
 case of a broken disk, so maybe more osd daemons with less size, maybe 1TB
 or 2TB size. 4:1 relationship is good enough, also i think that 200G disk
 for the journals would be ok, so you can save some money there, the osd's
 of course configured them as a JBOD, don't use any RAID under it, and use
 two different networks for public and cluster net.

 *German*

 2015-07-01 18:49 GMT-03:00 Nate Curry cu...@mosaicatm.com:

 I would like to get some clarification on the size of the journal disks
 that I should get for my new Ceph cluster I am planning.  I read about the
 journal settings on
 http://ceph.com/docs/master/rados/configuration/osd-config-ref/#journal-settings
 but that didn't really clarify it for me that or I just didn't get it.  I
 found in the Learning Ceph Packt book it states that you should have one
 disk for journalling for every 4 OSDs.  Using that as a reference I was
 planning on getting multiple systems with 8 x 6TB inline SAS drives for
 OSDs with two SSDs for journalling per host as well as 2 hot spares for the
 6TB drives and 2 drives for the OS.  I was thinking of 400GB SSD drives but
 am wondering if that is too much.  Any informed opinions would be
 appreciated.

 Thanks,

 *Nate Curry*


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

47 matches

Mail list logo