Re: [ceph-users] client did not provide supported auth type

2016-06-27 Thread Goncalo Borges

Hi...

Just to clarify, you could have just one but if that one is problematic 
than your cluster stops working. It is always better to have more than 
one and in odd numbers : 3, 5, ...


Regarding your specific problem, I am guessing it is related to keys and 
permissions because of the /'client did not provide supported auth type' 
message/


Check the status of your client admin key with 'ceph auth list' and 
their permissions. Should be something as


   client.admin
key: XX
auid: 0
caps: [mon] allow *
caps: [osd] allow *

Cheers
G


On 06/28/2016 01:40 PM, Goncalo Borges wrote:


Hi XiuCai

Shouldn't you have, at least, 2 mons?

Cheers

G.


On 06/28/2016 01:12 PM, 秀才 wrote:

Hi,

ther are 1 mon and 7 osds in my cluster now.
but it seems something wrong, because `rbd -p test reate pet --size 
1024` could not return.

and status is always below:

/cluster 41f3f57f-0ca8-4dac-ba10-9359043ae21a/
/ health HEALTH_WARN/
/256 pgs degraded/
/256 pgs stuck degraded/
/256 pgs stuck inactive/
/256 pgs stuck unclean/
/256 pgs stuck undersized/
/256 pgs undersized/
/ monmap e1: 1 mons at {a=//192.168.1.101//:6789/0}/
/election epoch 2, quorum 0 a/
/ osdmap e32: 7 osds: 7 up, 7 in/
/  pgmap v69: 256 pgs, 1 pools, 0 bytes data, 0 objects/
/231 MB used, 1442 GB / 1442 GB avail/
/ 256 undersized+degraded+peered/

i checked monitor's log:

/1 mon.a@0(leader).log v2235 check_sub sending message to client.? 
192.168.1.101:0/2395453635 with 0 entries (version 2235)/

/1 mon.a@0(leader).auth v27 client did not provide supported auth type/
/1 mon.a@0(leader).auth v27 client did not provide supported auth type/
/
/
what is the meaning of v2235 & v27 here ?
how to solve this problem ?

Regards,
XiuCai.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW  2006
T: +61 2 93511937


--
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW  2006
T: +61 2 93511937

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW AWS4 SignatureDoesNotMatch when requests with port != 80 or != 443

2016-06-27 Thread Khang Nguyễn Nhật
Thanks Javier Muñoz
.
I will see it.

2016-06-24 22:30 GMT+07:00 Javier Muñoz :

> Hi Khang,
>
> Today I had a look in a very similar issue...
>
> http://tracker.ceph.com/issues/16463
>
> I guess it could be the same bug you hit. I added some info in the
> ticket. Feel free to comment there.
>
> Thanks,
> Javier
>
> On 06/05/2016 04:17 PM, Khang Nguyễn Nhật wrote:
> > Hi!
> > I get the error "  SignatureDoesNotMatch" when I used
> > presigned url with endpoint port != 80 and != 443. For example, if I use
> > host http://192.168.1.1: then this is what I have in RGW log:
> > //
> > RGWEnv::set(): HTTP_HOST: 192.168.1.1:
> > //
> > RGWEnv::set(): SERVER_PORT: 
> > //
> > HTTP_HOST=192.168.1.1:
> > //
> > SERVER_PORT=
> > //
> > host=192.168.1.1
> > //
> > canonical headers format = host:192.168.1.1::
> > //
> > canonical request = GET
> > /
> >
> X-Amz-Algorithm=AWS4-HMAC-SHA256=%2F20160605%2Fap%2Fs3%2Faws4_request=20160605T125927Z=3600=host
> > host:192.168.1.1::
> >
> > host
> > UNSIGNED-PAYLOAD
> > //
> > - Verifying signatures
> > //
> > failed to authorize request
> > //
> >
> >
> > I see this in the src / rgw / rgw_rest_s3.cc:
> > int RGW_Auth_S3 :: authorize_v4 () {
> > //
> >   string port = s-> info.env-> get ( 'SERVER_PORT "," ");
> >   secure_port string = s-> info.env-> get ( 'SERVER_PORT_SECURE "," ");
> > //
> > if (using_qs && (token == "host")) {
> >   if (! port.empty () && port! = "80") {
> > token_value = token_value + ":" + port;
> >   } Else if (! Secure_port.empty () && secure_port! = "443") {
> > token_value = token_value + ":" + secure_port;
> >   }
> > }
> >
> > Is it caused my fault ? Can somebody please help me out ?
> > Thank !
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] client did not provide supported auth type

2016-06-27 Thread Goncalo Borges

Hi XiuCai

Shouldn't you have, at least, 2 mons?

Cheers

G.


On 06/28/2016 01:12 PM, 秀才 wrote:

Hi,

ther are 1 mon and 7 osds in my cluster now.
but it seems something wrong, because `rbd -p test reate pet --size 
1024` could not return.

and status is always below:

/cluster 41f3f57f-0ca8-4dac-ba10-9359043ae21a/
/ health HEALTH_WARN/
/256 pgs degraded/
/256 pgs stuck degraded/
/256 pgs stuck inactive/
/256 pgs stuck unclean/
/256 pgs stuck undersized/
/256 pgs undersized/
/ monmap e1: 1 mons at {a=//192.168.1.101//:6789/0}/
/election epoch 2, quorum 0 a/
/ osdmap e32: 7 osds: 7 up, 7 in/
/  pgmap v69: 256 pgs, 1 pools, 0 bytes data, 0 objects/
/231 MB used, 1442 GB / 1442 GB avail/
/ 256 undersized+degraded+peered/

i checked monitor's log:

/1 mon.a@0(leader).log v2235 check_sub sending message to client.? 
192.168.1.101:0/2395453635 with 0 entries (version 2235)/

/1 mon.a@0(leader).auth v27 client did not provide supported auth type/
/1 mon.a@0(leader).auth v27 client did not provide supported auth type/
/
/
what is the meaning of v2235 & v27 here ?
how to solve this problem ?

Regards,
XiuCai.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW  2006
T: +61 2 93511937

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] client did not provide supported auth type

2016-06-27 Thread ????
Hi,


ther are 1 mon and 7 osds in my cluster now.
but it seems something wrong, because `rbd -p test reate pet --size 1024` could 
not return.
and status is always below:


cluster 41f3f57f-0ca8-4dac-ba10-9359043ae21a
 health HEALTH_WARN
256 pgs degraded
256 pgs stuck degraded
256 pgs stuck inactive
256 pgs stuck unclean
256 pgs stuck undersized
256 pgs undersized
 monmap e1: 1 mons at {a=192.168.1.101:6789/0}
election epoch 2, quorum 0 a
 osdmap e32: 7 osds: 7 up, 7 in
  pgmap v69: 256 pgs, 1 pools, 0 bytes data, 0 objects
231 MB used, 1442 GB / 1442 GB avail
 256 undersized+degraded+peered



i checked monitor's log:


1 mon.a@0(leader).log v2235 check_sub sending message to client.? 
192.168.1.101:0/2395453635 with 0 entries (version 2235)
1 mon.a@0(leader).auth v27 client did not provide supported auth type
1 mon.a@0(leader).auth v27 client did not provide supported auth type



what is the meaning of v2235 & v27 here ?
how to solve this problem ?


Regards,
XiuCai.___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-mon.target and ceph-mds.target systemd dependencies in centos7

2016-06-27 Thread Goncalo Borges

Hi All...

Just upgraded from infernalis 9.2.0 to jewel 10.2.2 in centos7.

I do have a question regarding ceph-mon.target and ceph-mds.target 
systemd dependencies.


Before the upgrade, I had the following situation in a mon host, data 
host with 8 osds and mds host:



   # systemctl list-dependencies ceph.target
   ceph.target
   ● └─ceph-mon@rccephmon1.service


   # systemctl list-dependencies ceph.target
   ceph.target
   ● ├─ceph-osd@0.service
   ● ├─ceph-osd@1.service
   ● ├─ceph-osd@2.service
   ● ├─ceph-osd@3.service
   ● ├─ceph-osd@4.service
   ● ├─ceph-osd@5.service
   ● ├─ceph-osd@6.service
   ● └─ceph-osd@7.service


   # systemctl list-dependencies ceph.target
   ceph.target
   ● └─ceph-mds@rccephmds2.service


After upgrade (and reboot), I have:

   # systemctl list-dependencies ceph.target
   ceph.target
   ● ├─ceph-mon@rccephmon3.service
   ● ├─ceph-mds.target
   ● ├─ceph-mon.target
   ● └─ceph-osd.target


   # systemctl list-dependencies ceph.target
   ceph.target
   ● ├─ceph-osd@0.service
   ● ├─ceph-osd@1.service
   ● ├─ceph-osd@2.service
   ● ├─ceph-osd@3.service
   ● ├─ceph-osd@4.service
   ● ├─ceph-osd@5.service
   ● ├─ceph-osd@6.service
   ● ├─ceph-osd@7.service
   ● ├─ceph-mds.target
   ● ├─ceph-mon.target
   ● └─ceph-osd.target
   ●   ├─ceph-osd@0.service
   ●   ├─ceph-osd@1.service
   ●   ├─ceph-osd@2.service
   ●   ├─ceph-osd@3.service
   ●   ├─ceph-osd@4.service
   ●   ├─ceph-osd@5.service
   ●   ├─ceph-osd@6.service
   ●   └─ceph-osd@7.service

   # systemctl list-dependencies ceph.target
   ceph.target
   ● ├─ceph-mds@rccephmds2.service
   ● ├─ceph-mds.target
   ● ├─ceph-mon.target
   ● └─ceph-osd.target


I understand that ceph-{mon,osd,mds.target} are available in all hosts 
because the ceph metapackages pulls all ceph-{mon,osd,mds}.rpm even if 
we do not want to install a given service in a host.


   # rpm -qR ceph
   ceph-osd = 1:10.2.2-0.el7
   ceph-mds = 1:10.2.2-0.el7
   ceph-mon = 1:10.2.2-0.el7
   binutils
   systemd
   rpmlib(FileDigests) <= 4.6.0-1
   rpmlib(PayloadFilesHavePrefix) <= 4.0-1
   rpmlib(CompressedFileNames) <= 3.0.4-1
   rpmlib(PayloadIsXz) <= 5.2-1

However, what I am not understanding is why all ceph-osd@X.service were 
correctly set as dependencies of ceph-osd.target but 
ceph-mds@rccephmds2.service was not set as dependency of ceph-mds.target 
(and similarly, why ceph-mon@rccephmon3.service was not set as a 
dependency of ceph-mon.target).


The configuration of the unit seems fine but I do not understand why it 
is not applied (after systemctl daemon-reload or even after rebooting):


   # systemctl cat ceph-mon@rccephmon3.service
   # /usr/lib/systemd/system/ceph-mon@.service
   [Unit]
   Description=Ceph cluster monitor daemon

   # According to:
   #   http://www.freedesktop.org/wiki/Software/systemd/NetworkTarget
   # these can be removed once ceph-mon will dynamically change network
   # configuration.
   After=network-online.target local-fs.target time-sync.target
   ceph-create-keys@%i.service
   Wants=network-online.target local-fs.target time-sync.target
   ceph-create-keys@%i.service

   PartOf=ceph-mon.target

   [Service]
   LimitNOFILE=1048576
   LimitNPROC=1048576
   EnvironmentFile=-/etc/sysconfig/ceph
   Environment=CLUSTER=ceph
   ExecStart=/usr/bin/ceph-mon -f --cluster ${CLUSTER} --id %i
   --setuser ceph --setgroup ceph
   ExecReload=/bin/kill -HUP $MAINPID
   PrivateDevices=yes
   ProtectHome=true
   ProtectSystem=full
   PrivateTmp=true
   TasksMax=infinity
   Restart=on-failure
   StartLimitInterval=30min
   StartLimitBurst=3

   [Install]
   WantedBy=ceph-mon.target

Am i the only one seeing the issue? Is it really an issue?

Cheers
G.


--
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW  2006
T: +61 2 93511937

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Pinpointing performance bottleneck / would SSD journals help?

2016-06-27 Thread Christian Balzer

Hello,

On Mon, 27 Jun 2016 21:35:35 +0100 Nick Fisk wrote:
[snip]
> 
> You need to run iostat on the OSD nodes themselves and see what the disks
> are doing. You stated that they are doing ~180iops per disk, which
> suggests they are highly saturated and likely to be the cause of the
> problem. I'm guessing you will also see really high queue depths per
> disk, which normally is the cause of high latency.
>
This.
Which bosun (never used it) should have showed you already if it's
worth its salt.

Running atop (large window) on your OSD nodes should give you a very clear
picture, too. 
Including network usage (unlikely to be your problem, but your 1Gb/s links
will hurt you latency wise).

I predict you'll see lots of red and near 100% utilization on your OSD
drives when your cluster is getting into trouble. 

> If you add SSD journals and a large amount of your IO is writes, then you
> may see an improvement. But you may also be at the point where you just
> need more disks to be able to provide the required performance.
> 
SSD journals will roughly double your IOPS and since you're at best going
to write around 400MB/s due to your network bandwidth you can get away
with using less/smaller SSDs.
Two 200GB DC S3610s or one 400GB DC S3710 would do the trick.

Past that point you need to grow the cluster (more OSDs, of course with
SSD journals) and/or consider cache-tiering.
The later can give you dramatic gains, but this very much depends on your
usage patterns and size of your hot data.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Auto-Tiering

2016-06-27 Thread Christian Balzer

Hello,

On Mon, 27 Jun 2016 21:11:02 +0530 Rakesh Parkiti wrote:

> Hi All,
> Does CEPH support auto tiering?
> ThanksRakesh Parkiti

Googling for "auto tiering ceph" would have answered that question.

In short, it depends on how you define auto tiering. 

Ceph cache tiering is more of a cache than full multi-level storage
tiering (it has only has 2 levels).

But depending on configuration it can achieve similar improvements.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph not replicating to all osds

2016-06-27 Thread Christian Balzer

Hello,

On Mon, 27 Jun 2016 17:00:42 +0200 Ishmael Tsoaela wrote:

> Hi ALL,
> 
> Anyone can help with this issue would be much appreciated.
>
Your subject line has nothing to do with your "problem".

You're alluding to OSD replication problems, obviously assuming that one
client would write to OSD A and the other client reading from OSD B.
Which is not how Ceph works, but again, that's not your problem.
 
> I have created an  image on one client and mounted it on both 2 client I
> have setup.
> 
Details missing, but it's pretty obvious that you created a plain FS like
Ext4 on that image.

> When I write data on one client, I cannot access the data on another
> client, what could be causing this issue?
> 
This has cropped up here frequently, you're confusing replicated BLOCK
storage like RBD or DRBD with shared file systems like NFS of CephFS.

EXT4 and other normal FS can't do that and you just corrupted your FS on
that image.

So either use CephFS or run OCFS2/GFS2 on your shared image and clients.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph not replicating to all osds

2016-06-27 Thread Brad Hubbard
On Tue, Jun 28, 2016 at 1:00 AM, Ishmael Tsoaela  wrote:
> Hi ALL,
>
> Anyone can help with this issue would be much appreciated.
>
> I have created an  image on one client and mounted it on both 2 client I
> have setup.
>
> When I write data on one client, I cannot access the data on another client,
> what could be causing this issue?

I suspect you are talking about files showing up in a filesystem on
the rbd image you have
mounted on both clients? If so, you need to verify the chosen
filesystem supports that.

Let me know if I got this wrong (please provide a more detailed
description), or if you need
more information.

Cheers,
Brad

>
> root@nodeB:/mnt# ceph osd tree
> ID WEIGHT  TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 1.81738 root default
> -2 0.90869 host nodeB
>  0 0.90869 osd.0   up  1.0  1.0
> -3 0.90869 host nodeC
>  1 0.90869 osd.1   up  1.0  1.0
>
>
> cluster_master@nodeC:/mnt$ ceph osd dump | grep data
> pool 1 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash
> rjenkins pg_num 128 pgp_num 128 last_change 17 flags hashpspool stripe_width
> 0
>
>
> cluster_master@nodeC:/mnt$ cat decompiled-crush-map.txt
> # begin crush map
> tunable choose_local_tries 0
> tunable choose_local_fallback_tries 0
> tunable choose_total_tries 50
> tunable chooseleaf_descend_once 1
> tunable chooseleaf_vary_r 1
> tunable straw_calc_version 1
>
> # devices
> device 0 osd.0
> device 1 osd.1
>
> # types
> type 0 osd
> type 1 host
> type 2 chassis
> type 3 rack
> type 4 row
> type 5 pdu
> type 6 pod
> type 7 room
> type 8 datacenter
> type 9 region
> type 10 root
>
> # buckets
> host nodeB {
> id -2 # do not change unnecessarily
> # weight 0.909
> alg straw
> hash 0 # rjenkins1
> item osd.0 weight 0.909
> }
> host nodeC {
> id -3 # do not change unnecessarily
> # weight 0.909
> alg straw
> hash 0 # rjenkins1
> item osd.1 weight 0.909
> }
> root default {
> id -1 # do not change unnecessarily
> # weight 1.817
> alg straw
> hash 0 # rjenkins1
> item nodeB weight 0.909
> item nodeC weight 0.909
> }
>
> # rules
> rule replicated_ruleset {
> ruleset 0
> type replicated
> min_size 1
> max_size 10
> step take default
> step chooseleaf firstn 0 type host
> step emit
> }
>
> # end crush map
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Pinpointing performance bottleneck / would SSD journals help?

2016-06-27 Thread Nick Fisk
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Daniel Schneller
> Sent: 27 June 2016 17:33
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Pinpointing performance bottleneck / would SSD
> journals help?
> 
> On 2016-06-27 16:01:07 +, Lionel Bouton said:
> 
> > Le 27/06/2016 17:42, Daniel Schneller a écrit :
> >> Hi!
> >>
> >> * Network Link saturation.
> >> All links / bonds are well below any relevant load (around 35MB/s or
> >> less)
> > ...
> > Or you sure ? On each server you have 12 OSDs with a theoretical
> > bandwidth of at least half of 100MB/s (minimum bandwidth of any
> > reasonable HDD but halved because of the journal on the same device).
> > Which means your total disk bandwidth per server is 600MB/s.
> 
> Correct. However, I fear that because of lots of random IO going on, we
> won't be coming anywhere near that number, esp. with 3x replication.
> 
> > Bonded links are not perfect aggregation (depending on the mode one
> > client will either always use the same link or have its traffic
> > imperfectly balanced between the 2), so your theoretical network
> > bandwidth is probably nearest to 1Gbps (~ 120MB/s).
> 
> We use layer3+4 to spread traffic based on sources and destination IP and
> port information. Benchmarks have shown that using enough parallel
> streams we can saturate the full 250MB/s this ideally produces. You are
right,
> of course, that any single TCP connection will never exceed 1Gbps.
> 
> > What could happen is that the 35MB/s is an average over a large period
> > (several seconds), it's probably peaking at 120MB/s during short bursts.
> 
> That thought crossed my mind early on, too, but these values are based on
> /proc/net/dev which has counters for each network device. The statistics
are
> gathered by checking the difference between the current sample and the
> last. So this does not suffer from samples being taken at relatively long
> intervals.
> 
> > I wouldn't use less than 10Gbps for both the cluster and public
> > networks in your case.
> 
> I whole-heartedly agree... Certainly sensible, but for now we have to make
> due with the infrastructure we have. Still, based on the data we have so
far,
> the network at least doesn't jump at me as a (major) contributor to the
> slowness we see in this current scenario.
> 
> 
> > You didn't say how many VMs are running : the rkB/s and wkB/s seem
> > very low (note that for write intensive tasks your VM is reading quite
> > a
> > bit...) but if you have 10 VMs or more battling for read and write
> > access this way it wouldn't be unexpected. As soon as latency rises
> > for one reason or another (here it would be network latency) you can
> > expect the total throughput of random accesses to plummet.
> 
> In total there are about 25 VMs, however many of them are less I/O bound
> than MongoDB and Elasticsearch.  As for the comparatively high read load,
I
> agree, but I cannot really explain that in detail at the moment.
> 
> In general I would be very much interested in diagnosing the underlying
bare
> metal layer without making too many assumptions about what clients are
> actually doing. In this case we can look into the VMs, but in general it
would
> be ideal to pinpoint a bottleneck on the "lower" levels. Any improvements
> there would be beneficial to all client software.
> 

You need to run iostat on the OSD nodes themselves and see what the disks
are doing. You stated that they are doing ~180iops per disk, which suggests
they are highly saturated and likely to be the cause of the problem. I'm
guessing you will also see really high queue depths per disk, which normally
is the cause of high latency.

If you add SSD journals and a large amount of your IO is writes, then you
may see an improvement. But you may also be at the point where you just need
more disks to be able to provide the required performance.

> Cheers,
> Daniel
> 
> 
> --
> Daniel Schneller
> Principal Cloud Engineer
> 
> CenterDevice GmbH
> https://www.centerdevice.de
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd current.remove.me.somenumber

2016-06-27 Thread Gregory Farnum
On Sat, Jun 25, 2016 at 11:22 AM, Mike Miller  wrote:
> Hi,
>
> what is the meaning of the directory "current.remove.me.846930886" is
> /var/lib/ceph/osd/ceph-14?

If you're using btrfs, I believe that's a no-longer-required snapshot
of the current state of the system. If you're not, I've no idea what
creates directories named like that.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs mount /etc/fstab

2016-06-27 Thread Michael Hanscho
On 2016-06-27 11:40, John Spray wrote:
> On Sun, Jun 26, 2016 at 10:51 AM, Michael Hanscho  wrote:
>> On 2016-06-26 10:30, Christian Balzer wrote:
>>>
>>> Hello,
>>>
>>> On Sun, 26 Jun 2016 09:33:10 +0200 Willi Fehler wrote:
>>>
 Hello,

 I found an issue. I've added a ceph mount to my /etc/fstab. But when I
 boot my system it hangs:

 libceph: connect 192.168.0.5:6789 error -101

 After the system is booted I can successfully run mount -a.

>>>
>>> So what does that tell you?
>>> That Ceph can't connect during boot, because... there's no network yet.
>>>
>>> This is what the "_netdev" mount option is for.
>>>
>>
>> http://docs.ceph.com/docs/master/cephfs/fstab/
>>
>> No hint in the documentation - although a full page on cephfs and fstab?!
> 
> Yeah, that is kind of an oversight!  https://github.com/ceph/ceph/pull/9942

Thanks a lot!!

Gruesse
Michael
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Pinpointing performance bottleneck / would SSD journals help?

2016-06-27 Thread Daniel Schneller

On 2016-06-27 16:01:07 +, Lionel Bouton said:


Le 27/06/2016 17:42, Daniel Schneller a écrit :

Hi!

* Network Link saturation.
All links / bonds are well below any relevant load (around 35MB/s or
less)

...
Or you sure ? On each server you have 12 OSDs with a theoretical
bandwidth of at least half of 100MB/s (minimum bandwidth of any
reasonable HDD but halved because of the journal on the same device).
Which means your total disk bandwidth per server is 600MB/s.


Correct. However, I fear that because of lots of random IO going on,
we won't be coming anywhere near that number, esp. with 3x replication.


Bonded links are not perfect aggregation (depending on the mode one
client will either always use the same link or have its traffic
imperfectly balanced between the 2), so your theoretical network
bandwidth is probably nearest to 1Gbps (~ 120MB/s).


We use layer3+4 to spread traffic based on sources and destination
IP and port information. Benchmarks have shown that using enough
parallel streams we can saturate the full 250MB/s this ideally
produces. You are right, of course, that any single TCP connection
will never exceed 1Gbps.


What could happen is that the 35MB/s is an average over a large period
(several seconds), it's probably peaking at 120MB/s during short bursts.


That thought crossed my mind early on, too, but these values are based on
/proc/net/dev which has counters for each network device. The statistics
are gathered by checking the difference between the current sample and
the last. So this does not suffer from samples being taken at relatively
long intervals.


I wouldn't use less than 10Gbps for both the cluster and public networks
in your case.


I whole-heartedly agree... Certainly sensible, but for now we have to make
due with the infrastructure we have. Still, based on the data we have so far,
the network at least doesn't jump at me as a (major) contributor to the
slowness we see in this current scenario.



You didn't say how many VMs are running : the rkB/s and wkB/s seem very
low (note that for write intensive tasks your VM is reading quite a
bit...) but if you have 10 VMs or more battling for read and write
access this way it wouldn't be unexpected. As soon as latency rises for
one reason or another (here it would be network latency) you can expect
the total throughput of random accesses to plummet.


In total there are about 25 VMs, however many of them are less I/O bound
than MongoDB and Elasticsearch.  As for the comparatively high read load,
I agree, but I cannot really explain that in detail at the moment.

In general I would be very much interested in diagnosing the underlying
bare metal layer without making too many assumptions about what clients
are actually doing. In this case we can look into the VMs, but in general
it would be ideal to pinpoint a bottleneck on the "lower" levels. Any
improvements there would be beneficial to all client software.

Cheers,
Daniel


--
Daniel Schneller
Principal Cloud Engineer

CenterDevice GmbH
https://www.centerdevice.de


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Pinpointing performance bottleneck / would SSD journals help?

2016-06-27 Thread Lionel Bouton
Le 27/06/2016 17:42, Daniel Schneller a écrit :
> Hi!
>
> We are currently trying to pinpoint a bottleneck and are somewhat stuck.
>
> First things first, this is the hardware setup:
>
> 4x DELL PowerEdge R510, 12x4TB OSD HDDs, journal colocated on HDD
>   96GB RAM, 2x6 Cores + HT
> 2x1GbE bonded interfaces for Cluster Network
> 2x1GbE bonded interfaces for Public Network
> Ceph Hammer on Ubuntu 14.04
>
> 6 OpenStack Compute Nodes with all-RBD VMs (no ephemeral storage).
>
> The VMs run a variety of stuff, most notable MongoDB, Elasticsearch
> and our custom software which uses both the VM's virtual disks as
> well the Rados Gateway for Object Storage.
>
> Recently, under certain more write intensive conditions we see reads
> overall system performance starting to suffer as well.
>
> Here is an iostat -x 3 sample for one of the VMs hosting MongoDB.
> Notice the "await" times (vda is the root, vdb is the data volume).
>
>
> Linux 3.13.0-35-generic (node02) 06/24/2016 _x86_64_(16 CPU)
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>   1.550.000.440.420.00   97.59
>
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> vda   0.00 0.910.091.01 2.55 9.59   
> 22.12 0.01  266.90 2120.51   98.59   4.76   0.52
> vdb   0.00 1.53   18.39   40.79   405.98   483.92   
> 30.07 0.305.685.425.80   3.96  23.43
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>   5.050.002.083.160.00   89.71
>
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> vda   0.00 0.000.000.00 0.00 0.00
> 0.00 0.000.000.000.00   0.00   0.00
> vdb   0.00 7.00   23.00   29.00   368.00   500.00   
> 33.38 1.91  446.00  422.26  464.83  19.08  99.20
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>   4.430.001.734.940.00   88.90
>
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> vda   0.00 0.000.000.00 0.00 0.00
> 0.00 0.000.000.000.00   0.00   0.00
> vdb   0.0013.00   45.00   83.00   712.00  1041.00   
> 27.39 2.54 1383.25  272.18 1985.64   7.50  96.00
>
>
> If we read this right, the average time spent waiting for read or write
> requests to be serviced can be multi-second. This would go in line with
> MongoDB's slow log, where we see fully indexed queries, returning a
> single result, taking over a second, where they would normally be
> finished
> quasi instantly.
>
> So far we have looked at these metrics (using StackExchange's Bosun
> from https://bosun.org). Most values are collected every 15 seconds.
>
> * Network Link saturation.
>  All links / bonds are well below any relevant load (around 35MB/s or
>  less)

Or you sure ? On each server you have 12 OSDs with a theoretical
bandwidth of at least half of 100MB/s (minimum bandwidth of any
reasonable HDD but halved because of the journal on the same device).
Which means your total disk bandwidth per server is 600MB/s.
Bonded links are not perfect aggregation (depending on the mode one
client will either always use the same link or have its traffic
imperfectly balanced between the 2), so your theoretical network
bandwidth is probably nearest to 1Gbps (~ 120MB/s).

What could happen is that the 35MB/s is an average over a large period
(several seconds), it's probably peaking at 120MB/s during short bursts.
I wouldn't use less than 10Gbps for both the cluster and public networks
in your case.

You didn't say how many VMs are running : the rkB/s and wkB/s seem very
low (note that for write intensive tasks your VM is reading quite a
bit...) but if you have 10 VMs or more battling for read and write
access this way it wouldn't be unexpected. As soon as latency rises for
one reason or another (here it would be network latency) you can expect
the total throughput of random accesses to plummet.

If your cluster isn't already backfilling or deep scrubbing you can
expect it to crumble on itself when it does (and it will have to perform
these at some point)...

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dramatic performance drop at certain number of objects in pool

2016-06-27 Thread Mark Nelson

On 06/27/2016 03:12 AM, Blair Bethwaite wrote:

On 25 Jun 2016 6:02 PM, "Kyle Bader" > wrote:

fdatasync takes longer when you have more inodes in the slab caches,

it's the double edged sword of vfs_cache_pressure.

That's a bit sad when, iiuc, it's only journals doing fdatasync in the
Ceph write path. I'd have expected the vfs to handle this on a per fs
basis (and a journal filesystem would have very little in the inode cache).

It's somewhat annoying there isn't a way to favor dentries (and perhaps
dentry inodes) over other inodes in the vfs cache. Our experience shows
that it's dentry misses that cause the major performance issues (makes
sense when you consider the osd is storing all its data in the leafs of
the on disk PG structure).

This is another discussion that seems to backup the choice to implement
bluestore.


Indeed.

Mark



Cheers,
Blair



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Pinpointing performance bottleneck / would SSD journals help?

2016-06-27 Thread Daniel Schneller

Hi!

We are currently trying to pinpoint a bottleneck and are somewhat stuck.

First things first, this is the hardware setup:

4x DELL PowerEdge R510, 12x4TB OSD HDDs, journal colocated on HDD
  96GB RAM, 2x6 Cores + HT
2x1GbE bonded interfaces for Cluster Network
2x1GbE bonded interfaces for Public Network
Ceph Hammer on Ubuntu 14.04

6 OpenStack Compute Nodes with all-RBD VMs (no ephemeral storage).

The VMs run a variety of stuff, most notable MongoDB, Elasticsearch
and our custom software which uses both the VM's virtual disks as
well the Rados Gateway for Object Storage.

Recently, under certain more write intensive conditions we see reads
overall system performance starting to suffer as well.

Here is an iostat -x 3 sample for one of the VMs hosting MongoDB.
Notice the "await" times (vda is the root, vdb is the data volume).


Linux 3.13.0-35-generic (node02)06/24/2016  _x86_64_(16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
  1.550.000.440.420.00   97.59

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda   0.00 0.910.091.01 2.55 9.59
22.12 0.01  266.90 2120.51   98.59   4.76   0.52
vdb   0.00 1.53   18.39   40.79   405.98   483.92
30.07 0.305.685.425.80   3.96  23.43


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
  5.050.002.083.160.00   89.71

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda   0.00 0.000.000.00 0.00 0.00 
0.00 0.000.000.000.00   0.00   0.00
vdb   0.00 7.00   23.00   29.00   368.00   500.00
33.38 1.91  446.00  422.26  464.83  19.08  99.20


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
  4.430.001.734.940.00   88.90

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda   0.00 0.000.000.00 0.00 0.00 
0.00 0.000.000.000.00   0.00   0.00
vdb   0.0013.00   45.00   83.00   712.00  1041.00
27.39 2.54 1383.25  272.18 1985.64   7.50  96.00



If we read this right, the average time spent waiting for read or write
requests to be serviced can be multi-second. This would go in line with
MongoDB's slow log, where we see fully indexed queries, returning a
single result, taking over a second, where they would normally be finished
quasi instantly.

So far we have looked at these metrics (using StackExchange's Bosun
from https://bosun.org). Most values are collected every 15 seconds.

* Network Link saturation.
 All links / bonds are well below any relevant load (around 35MB/s or
 less)

* Storage Node RAM
 At least 3GB reported "free", between 50GB and 70GB as cached.

* Storage node CPU.
 Hardly above 30%

* # of ios in progress per OSD (as per /proc/diskstats)
 These reach values of up to 180.



Bosun collects the raw data for these metrics (and lots of others)
every 15 seconds.

We have a suspicion the spinners are the culprit here, but to verify
this and to be able to convince the upper layers of company leadership
to invest in some SSDs for journals, we need better evidence; apart
from the personal desire to understand exactly what's going on here :)

Regardless of the VMs on top (which could be any client, as I see it)
which metrics would I have to collect/look at to verify/reject the
assumption that we are limited by our pure HDD setup?


Thanks a lot!

Daniel


--
Daniel Schneller
Principal Cloud Engineer

CenterDevice GmbH
https://www.centerdevice.de


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Auto-Tiering

2016-06-27 Thread Rakesh Parkiti
Hi All,
Does CEPH support auto tiering?
ThanksRakesh Parkiti  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph not replicating to all osds

2016-06-27 Thread Ishmael Tsoaela
Hi ALL,

Anyone can help with this issue would be much appreciated.

I have created an  image on one client and mounted it on both 2 client I
have setup.

When I write data on one client, I cannot access the data on another
client, what could be causing this issue?

root@nodeB:/mnt# ceph osd tree
ID WEIGHT  TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 1.81738 root default
-2 0.90869 host nodeB
 0 0.90869 osd.0   up  1.0  1.0
-3 0.90869 host nodeC
 1 0.90869 osd.1   up  1.0  1.0


cluster_master@nodeC:/mnt$ ceph osd dump | grep data
pool 1 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 128 pgp_num 128 last_change 17 flags hashpspool
stripe_width 0


cluster_master@nodeC:/mnt$ cat decompiled-crush-map.txt
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable straw_calc_version 1

# devices
device 0 osd.0
device 1 osd.1

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host nodeB {
id -2 # do not change unnecessarily
# weight 0.909
alg straw
hash 0 # rjenkins1
item osd.0 weight 0.909
}
host nodeC {
id -3 # do not change unnecessarily
# weight 0.909
alg straw
hash 0 # rjenkins1
item osd.1 weight 0.909
}
root default {
id -1 # do not change unnecessarily
# weight 1.817
alg straw
hash 0 # rjenkins1
item nodeB weight 0.909
item nodeC weight 0.909
}

# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

# end crush map
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Should I use different pool?

2016-06-27 Thread Kanchana. P
calamari URL displays below error:

New Calamari Installation
This appears to be the first time you have started Calamari and there are
no clusters currently configured.
3 Ceph servers are connected to Calamari, but no Ceph cluster has been
created yet. Please use ceph-deploy to create a cluster; please see the
Inktank Ceph Enterprise documentation for more details.

When executed ceph-deploy connect again, "calamari.conf"file changes to
"master: None".

My cluster have 4 nodes:
AMCNode: admin + mon + calamari
siteAosd
siteBosd
siteCosd

ceph version 10.2.2 on ubuntu 14.04
salt version  2014.7.5+ds-1ubuntu1
diamond 3.4.67_all.deb

1. Install deb packages for calamari on admin/calamari server node

wget
http://download.ceph.com/debian-jewel/pool/main/c/ceph-deploy/ceph-deploy_1.5.34_all.deb
sudo dpkg -i ceph-deploy_1.5.34_all.deb

2. Downloaded calamari deb packages on admin/calamari server node

sudo wget
http://download.ceph.com/calamari/1.3.1/ubuntu/trusty/pool/main/c/calamari/calamari-server_1.3.1.1-1trusty_amd64.deb
sudo wget
http://download.ceph.com/calamari/1.3.1/ubuntu/trusty/pool/main/c/calamari-clients/calamari-clients_1.3.1.1-1trusty_all.deb
sudo wget
http://download.ceph.com/calamari/1.3.1/ubuntu/trusty/pool/main/d/diamond/diamond_3.4.67_all.deb


3. Install salt 2014.7 version on admin/calamari server node

sudo add-apt-repository ppa:saltstack/salt2014-7

4. Then ran the commands below from calamari server  or admin node

sudo apt-get update
sudo apt-get install salt-master
sudo apt-get install salt-minion
sudo apt-get install -y apache2 libapache2-mod-wsgi libcairo2
supervisor python-cairo libpq5 postgresql
sudo apt-get -f install
sudo dpkg -i calamari-server*.deb calamari-clients*.deb
sudo calamari-ctl initialize

5. Edited ceph.conf file and added

[ceph-deploy-calamari]
master = amcnode

pushed config file to all other nodes
ceph-deploy --overwrite-conf config push amcnode siteAosd siteBosd
siteCosd

6. Installed salt package on other nodes. Copied diamond package to all
other nodes and installed

sudo add-apt-repository ppa:saltstack/salt2014-7
sudo dpkg -i diamond_3.4.67_all.deb
sudo apt-get install  python-support

7. Executed the below command from calamari server node /admin node

ceph-deploy calamari connect siteAosd siteBosd siteCosd

8. URL shows all the 3 nodes and requested to "Add". Adding the nodes
failed.

9. calamari.conf file in calamari client nodes has "master: None", modified
it to "master: amcnode", restarted salt minion.

sudo vi /etc/salt/minion.d/calamari.conf
master: amcnode

sudo service salt-minion restart

10. URL still shows the below error:

New Calamari Installation
This appears to be the first time you have started Calamari and there are
no clusters currently configured.
3 Ceph servers are connected to Calamari, but no Ceph cluster has been
created yet. Please use ceph-deploy to create a cluster; please see the
Inktank Ceph Enterprise documentation for more details.

11. Executed ceph-deploy connect again, now calamari.conf file is changed
to "master: None"again.

Thanks for your help in advance.

On Sun, Jun 26, 2016 at 2:48 PM, EM - SC 
wrote:

> Hi,
>
> I'm new to ceph and in the mailing list, so hello all!
>
> I'm testing ceph and the plan is to migrate our current 18TB storage
> (zfs/nfs) to ceph. This will be using CephFS and mounted in our backend
> application.
> We are also planning on using virtualisation (opennebula) with rbd for
> images and, if it makes sense, use rbd for our oracle server.
>
> My question is about pools.
> For what I read, I should create different pools for different HD speed
> (SAS, SSD, etc).
> - What else should I consider for creating pools?
> - should I create different pools for rbd, cephfs, etc?
>
> thanks in advanced,
> em
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Should I use different pool?

2016-06-27 Thread David
Yes you should definitely create different pools for different HDD types.
Another decision you need to make is whether you want dedicated nodes for
SSD or want to mix them in the same node. You need to ensure you have
sufficient CPU and fat enough network links to get the most out of your
SSD's.

You can add multiple data pools to Cephfs so if you can identify the hot
and cold data in your dataset you could do "manual" tiering as an
alternative to using a cache tier.

18TB is a relatively small capacity, have you considered an all-SSD cluster?

On Sun, Jun 26, 2016 at 10:18 AM, EM - SC 
wrote:

> Hi,
>
> I'm new to ceph and in the mailing list, so hello all!
>
> I'm testing ceph and the plan is to migrate our current 18TB storage
> (zfs/nfs) to ceph. This will be using CephFS and mounted in our backend
> application.
> We are also planning on using virtualisation (opennebula) with rbd for
> images and, if it makes sense, use rbd for our oracle server.
>
> My question is about pools.
> For what I read, I should create different pools for different HD speed
> (SAS, SSD, etc).
> - What else should I consider for creating pools?
> - should I create different pools for rbd, cephfs, etc?
>
> thanks in advanced,
> em
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Jewel Multisite RGW Memory Issues

2016-06-27 Thread Ben Agricola
Hi Pritha,

Urgh, not sure what happened to the formatting there - let's try again.

At the time, the 'primary' cluster (i.e. the one with the active data set)
was receiving backup files from a small number of machines, prior to
replication being enabled it was using ~10% RAM on the RadosGW boxes.

Without replication enabled, neither cluster sees any spikes in memory
usage under normal operation, with a slight increase when deep scrubbing
(I'm monitoring cluster memory usage as a whole so OSD memory increases
would account for that).

Neither cluster was performing a deep scrub at the time. The 'secondary'
cluster (i.e. the one I was trying to sync data to, which now has
replication disabled again) has now had a RadosGW process running under
normal load since June 17 with replication disabled and is using 1084M RSS.
This matches with historical graphing for the primary cluster, which has
hovered around 1G RSS for RadosGW processes for the last 6 months.

I've just tested this out this morning and enabling replication caused all
RadosGW processes to increase in memory usage (and continue increasing)
from ~1000M RSS to ~20G RSS in about 2 minutes. As soon as replication is
enabled (as in, within seconds) RSS of RadosGW on both clusters starts to
increase and does not drop. This appears to happen during metadata sync as
well as during normal data syncing.

I then killed all RadosGW processes on the 'primary' side, and memory usage
of the RadosGW processes on the 'secondary' side continue to increase in
usage at the same rate. There are no further messages in the RadosGW log as
this is occurring (since there is no client traffic and no further
replication traffic). If I kill the active RadosGW processes then they
start back up and normal memory usage resumes.

Cheers,

Ben.


On Mon, 27 Jun 2016 at 10:39 Ben Agricola  wrote:

> Hi Pritha,
>
>
> At the time, the 'primary' cluster (i.e. the one with the active data set) 
> was receiving backup files from a small number of machines, prior to 
> replication being
>
> enabled it was using ~10% RAM on the RadosGW boxes.
>
>
> Without replication enabled, neither cluster sees any spikes in memory usage 
> under normal operation, with a slight increase when deep scrubbing (I'm 
> monitoring
>
> cluster memory usage as a whole so OSD memory increases would account for 
> that). Neither cluster was performing a deep scrub at the time. The 
> 'secondary' cluster
>
> (i.e. the one I was trying to sync data to, which now has replication 
> disabled again) has now had a RadosGW process running under normal load since 
> June 17
>
> with replication disabled and is using 1084M RSS. This matches with 
> historical graphing for the primary cluster, which has hovered around 1G RSS 
> for RadosGW
>
> processes for the last 6 months.
>
>
> I've just tested this out this morning and enabling replication caused all 
> RadosGW processes to increase in memory usage (and continue increasing) from 
> ~1000M RSS
>
> to ~20G RSS in about 2 minutes. As soon as replication is enabled (as in, 
> within seconds) RSS of RadosGW on both clusters starts to increase and does 
> not drop. This
>
> appears to happen during metadata sync as well as during normal data syncing 
> as well.
>
>
> I then killed all RadosGW processes on the 'primary' side, and memory usage 
> of the RadosGW processes on the 'secondary' side continue to increase in 
> usage at
>
> the same rate. There are no further messages in the RadosGW log as this is 
> occurring (since there is no client traffic and no further replication 
> traffic).
>
> If I kill the active RadosGW processes then they start back up and normal 
> memory usage resumes.
>
> Cheers,
>
> Ben.
>
>
> - Original Message -
> > From: "Pritha Srivastava"  > >
> > To: ceph-users@... 
> > 
> > Sent: Monday, June 27, 2016 07:32:23
> > Subject: Re: [ceph-users] Jewel Multisite RGW Memory Issues
>
> > Do you know if the memory usage is high only during load from clients and is
> > steady otherwise?
> > What was the nature of the workload at the time of the sync operation?
>
> > Thanks,
> > Pritha
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fsmap question

2016-06-27 Thread John Spray
On Mon, Jun 27, 2016 at 8:02 AM, Goncalo Borges
 wrote:
> Hi All ...
>
> just updated from infernalis to jewel 10.2.2 in centos7
>
> The procedure worked fine apart from the issue also reported on this thread:
> "osds udev rules not triggered on reboot (jewel, jessie)".
>
> Apart from that, I am not understanding the fsmap output provided by 'ceph
> -s' which is
>
> # ceph-s
> (...)
> fsmap e812: 1/1/1 up {0=rccephmds2=up:standby-replay}

Apologies, you are hitting a bug in the code that generates that line
(http://tracker.ceph.com/issues/15968).

It's just a glitch in the print, so under the hood you're fine (as you
have noticed by looking at mds dump).  This will get fixed in a Jewel
point release when the backport merges
(https://github.com/ceph/ceph/pull/9547)

John



>
> If you look to my mds dump, these is what I have:
>
> # ceph mds dump
> (...)
> 411196:192.231.127.32:6800/1457 'rccephmds' mds.0.805 up:active seq 152
> 440708:192.231.127.53:6800/833 'rccephmds2' mds.0.0 up:standby-replay
> seq 1
>
> so I was expecting an output by 'ceph -s' like
>
>fsmap e812: 1/1/1 up {0=rccephmds=up:active}
> 1=rccephmds2=up:standby-replay
>
> On my config I have
>
> [mds.rccephmds]
> host = rccephmds
> mds standby replay = true
>
> [mds.rccephmds2]
> host = rccephmds2
> mds standby_for_rank = rccephmds
> mds standby replay = true
>
>
> Am I doing something particularly different than what is expected?
>
> Cheers
> G.
>
> --
> Goncalo Borges
> Research Computing
> ARC Centre of Excellence for Particle Physics at the Terascale
> School of Physics A28 | University of Sydney, NSW  2006
> T: +61 2 93511937
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs mount /etc/fstab

2016-06-27 Thread John Spray
On Sun, Jun 26, 2016 at 10:51 AM, Michael Hanscho  wrote:
> On 2016-06-26 10:30, Christian Balzer wrote:
>>
>> Hello,
>>
>> On Sun, 26 Jun 2016 09:33:10 +0200 Willi Fehler wrote:
>>
>>> Hello,
>>>
>>> I found an issue. I've added a ceph mount to my /etc/fstab. But when I
>>> boot my system it hangs:
>>>
>>> libceph: connect 192.168.0.5:6789 error -101
>>>
>>> After the system is booted I can successfully run mount -a.
>>>
>>
>> So what does that tell you?
>> That Ceph can't connect during boot, because... there's no network yet.
>>
>> This is what the "_netdev" mount option is for.
>>
>
> http://docs.ceph.com/docs/master/cephfs/fstab/
>
> No hint in the documentation - although a full page on cephfs and fstab?!

Yeah, that is kind of an oversight!  https://github.com/ceph/ceph/pull/9942

John

> Gruesse
> Michael
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Jewel Multisite RGW Memory Issues

2016-06-27 Thread Ben Agricola
Hi Pritha,


At the time, the 'primary' cluster (i.e. the one with the active data
set) was receiving backup files from a small number of machines, prior
to replication being

enabled it was using ~10% RAM on the RadosGW boxes.


Without replication enabled, neither cluster sees any spikes in memory
usage under normal operation, with a slight increase when deep
scrubbing (I'm monitoring

cluster memory usage as a whole so OSD memory increases would account
for that). Neither cluster was performing a deep scrub at the time.
The 'secondary' cluster

(i.e. the one I was trying to sync data to, which now has replication
disabled again) has now had a RadosGW process running under normal
load since June 17

with replication disabled and is using 1084M RSS. This matches with
historical graphing for the primary cluster, which has hovered around
1G RSS for RadosGW

processes for the last 6 months.


I've just tested this out this morning and enabling replication caused
all RadosGW processes to increase in memory usage (and continue
increasing) from ~1000M RSS

to ~20G RSS in about 2 minutes. As soon as replication is enabled (as
in, within seconds) RSS of RadosGW on both clusters starts to increase
and does not drop. This

appears to happen during metadata sync as well as during normal data
syncing as well.


I then killed all RadosGW processes on the 'primary' side, and memory
usage of the RadosGW processes on the 'secondary' side continue to
increase in usage at

the same rate. There are no further messages in the RadosGW log as
this is occurring (since there is no client traffic and no further
replication traffic).

If I kill the active RadosGW processes then they start back up and
normal memory usage resumes.

Cheers,

Ben.


- Original Message -
> From: "Pritha Srivastava"  >
> To: ceph-users@... 
> 
> Sent: Monday, June 27, 2016 07:32:23
> Subject: Re: [ceph-users] Jewel Multisite RGW Memory Issues

> Do you know if the memory usage is high only during load from clients and is
> steady otherwise?
> What was the nature of the workload at the time of the sync operation?

> Thanks,
> Pritha
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Regarding GET BUCKET ACL REST call

2016-06-27 Thread Anand Bhat
Hi,

When GET BUCKET ACL REST call is issued with X-Auth-Token set, call fails.
This is due to bucket in question not having CORS settings. Is there a way
to set CORS on the S3 bucket with REST APIs?  I know a way using boto S3
that works. I am looking for REST APIs for CORS setting.

Regards,
Anand

-- 

Never say never.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dramatic performance drop at certain number of objects in pool

2016-06-27 Thread Blair Bethwaite
On 25 Jun 2016 6:02 PM, "Kyle Bader"  wrote:
> fdatasync takes longer when you have more inodes in the slab caches, it's
the double edged sword of vfs_cache_pressure.

That's a bit sad when, iiuc, it's only journals doing fdatasync in the Ceph
write path. I'd have expected the vfs to handle this on a per fs basis (and
a journal filesystem would have very little in the inode cache).

It's somewhat annoying there isn't a way to favor dentries (and perhaps
dentry inodes) over other inodes in the vfs cache. Our experience shows
that it's dentry misses that cause the major performance issues (makes
sense when you consider the osd is storing all its data in the leafs of the
on disk PG structure).

This is another discussion that seems to backup the choice to implement
bluestore.

Cheers,
Blair
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg scrub and auto repair in hammer

2016-06-27 Thread Christian Balzer

Hello,

On Mon, 27 Jun 2016 09:49:54 +0200 Dan van der Ster wrote:

> On Mon, Jun 27, 2016 at 2:14 AM, Christian Balzer  wrote:
> > On Sun, 26 Jun 2016 19:48:18 +0200 Stefan Priebe wrote:
> >
> >> Hi,
> >>
> >> is there any option or chance to have auto repair of pgs in hammer?
> >>
> > Short answer:
> > No, in any version of Ceph.
> 
> Well, jewel has a new option to auto-repair a PG if the num errors
> found in deep-scrub are above a threshold:
> 
> "osd_scrub_auto_repair": "false",
> "osd_scrub_auto_repair_num_errors": "5",
> 
The only good thing about this would be that it's off by default.

Because as discussed countless times here and mentioned by Sage in the
Bluestore tech talk last week, there isn't really any automatic way to
determine what the good data is.

Looking at the names for those config options I don't see "PG" in there,
so maybe this is indeed on a per OSD level, meaning that if an OSD racks
up more than 5 (distinct) scrub errors it will get auto-repaired.
That would be based on the assumption that an OSD with so many errors is
failing in some sorts. 
But I can see scenarios where with low replication and OSD numbers this
may fail spectacular.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg scrub and auto repair in hammer

2016-06-27 Thread Dan van der Ster
On Mon, Jun 27, 2016 at 2:14 AM, Christian Balzer  wrote:
> On Sun, 26 Jun 2016 19:48:18 +0200 Stefan Priebe wrote:
>
>> Hi,
>>
>> is there any option or chance to have auto repair of pgs in hammer?
>>
> Short answer:
> No, in any version of Ceph.

Well, jewel has a new option to auto-repair a PG if the num errors
found in deep-scrub are above a threshold:

"osd_scrub_auto_repair": "false",
"osd_scrub_auto_repair_num_errors": "5",

--
Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Jewel Multisite RGW Memory Issues

2016-06-27 Thread Pritha Srivastava
Corrected the formatting of the e-mail sent earlier.

- Original Message -
> From: "Pritha Srivastava" 
> To: ceph-users@lists.ceph.com
> Sent: Monday, June 27, 2016 9:15:36 AM
> Subject: Re: [ceph-users] Jewel Multisite RGW Memory Issues
> 
> 
> I have 2 distinct clusters configured, in 2 different locations, and 1
> zonegroup.
> 
> Cluster 1 has ~11TB of data currently on it, S3 / Swift backups via
> the duplicity backup tool - each file is 25Mb and probably 20% are
> multipart uploads from S3 (so 4Mb stripes) - in total 3217kobjects.
> This cluster has been running for months (without RGW replication)
> with no issue. Each site has 1 RGW instance at the moment.
> 
> I recently set up the second cluster on identical hardware in a
> secondary site. I configured a multi-site setup, with both of these
> sites in an active-active configuration. The second cluster has no
> active data set, so I would expect site 1 to start mirroring to site 2
> - and it does.
> 
> Unfortunately as soon as the RGW syncing starts to run, the resident
> memory usage of radosgw instances on both clusters balloons massively
> until the process is OOMed. This isn't a slow leak - when testing I've
> found that the radosgw processes on either side can consume up to
> 300MB/s of extra RSS per *second*, completely ooming a machine with
> 96GB of ram in approximately 20 minutes.
> 
> If I stop the radosgw processes on one cluster (i.e. breaking
> replication) then the memory usage of the radosgw processes on the
> other cluster stays at around 100-500MB and does not really increase
> over time.
> 
> Obviously this makes multi-site replication completely unusable so
> wondering if anyone has a fix or workaround. I noticed some pull
> requests have been merged into the master branch for RGW memory leak
> fixes so I switched to v10.2.0-2453-g94fac96 from autobuild packages,
> it seems like this slows the memory increase slightly but not enough
> to make replication usable yet.
> 
> I've tried valgrinding the radosgw process but doesn't come up with
> anything obviously leaking (I could be doing it wrong), but an example
> of the memory ballooning is captured by collectd:
> http://i.imgur.com/jePYnwz.png - this memory usage is *all* on the
> radosgw process RSS.
> 
> Anyone else seen this?

Do you know if the memory usage is high only during load from clients and is
steady otherwise?
What was the nature of the workload at the time of the sync operation?

Thanks,
Pritha
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] image map failed

2016-06-27 Thread Ishmael Tsoaela
Hi Rakesh,

That works as well. I also diabled the other features.

rbd feature   disable data/data_01  exclusive-lock


Thanks for the response

On Fri, Jun 24, 2016 at 6:22 AM, Rakesh Parkiti 
wrote:

> Hi Ishmael
>
> Once try to create image with image-feature as layering only.
>
> #rbd create --image pool-name/image-name --size 15G --mage-feature
> layering
> # rbd map  --image pool-name/image-name
>
> Thanks
> Rakesh Parkiti
> On Jun 23, 2016 19:46, Ishmael Tsoaela  wrote:
>
> Hi All,
>
> I  have created an image but cannot map the image, anybody know what could
> be the problem:
>
>
>
> sudo rbd map data/data_01
>
> rbd: sysfs write failed
> RBD image feature set mismatch. You can disable features unsupported by
> the kernel with "rbd feature disable".
> In some cases useful info is found in syslog - try "dmesg | tail" or so.
> rbd: map failed: (6) No such device or address
>
>
>
> cluster_master@nodeC:~$ dmesg |tail
> [89572.831725] libceph: client4227 fsid
> 70cc6b75-9f83-4c67-a1c4-4fe846b4849e
> [89572.832413] libceph: mon0 155.232.195.4:6789 session established
> [89573.042375] libceph: client4229 fsid
> 70cc6b75-9f83-4c67-a1c4-4fe846b4849e
> [89573.043046] libceph: mon0 155.232.195.4:6789 session established
>
>
>
> command to create image:
>
> rbd create data_01 --size 102400 --pool data
>
>
> cluster_master@nodeC:~$ rbd ls data
> data_01
>
>
> cluster_master@nodeC:~$ rbd --image data_01 -p data info
> rbd image 'data_01':
> size 102400 MB in 25600 objects
> order 22 (4096 kB objects)
> block_name_prefix: rbd_data.105f2ae8944a
> format: 2
> features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
> flags:
>
>
> cluster_master@nodeC:~$ ceph status
> cluster 70cc6b75-9f83-4c67-a1c4-4fe846b4849e
>  health HEALTH_OK
>  monmap e1: 1 mons at {nodeB=155.232.195.4:6789/0}
> election epoch 3, quorum 0 nodeB
>  osdmap e17: 2 osds: 2 up, 2 in
> flags sortbitwise
>   pgmap v160: 192 pgs, 2 pools, 6454 bytes data, 5 objects
> 10311 MB used, 1851 GB / 1861 GB avail
>  192 active+clean
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] fsmap question

2016-06-27 Thread Goncalo Borges

Hi All ...

just updated from infernalis to jewel 10.2.2 in centos7

The procedure worked fine apart from the issue also reported on this 
thread: "osds udev rules not triggered on reboot (jewel, jessie)".


Apart from that, I am not understanding the fsmap output provided by 
'ceph -s' which is


   # ceph-s
   (...)
   fsmap e812: 1/1/1 up {0=rccephmds2=up:standby-replay}

If you look to my mds dump, these is what I have:

   # ceph mds dump
   (...)
   411196:192.231.127.32:6800/1457 'rccephmds' mds.0.805 up:active
   seq 152
   440708:192.231.127.53:6800/833 'rccephmds2' mds.0.0
   up:standby-replay seq 1

so I was expecting an output by 'ceph -s' like

   fsmap e812: 1/1/1 up {0=rccephmds=up:active} 
1=rccephmds2=up:standby-replay


On my config I have

   [mds.rccephmds]
   host = rccephmds
   mds standby replay = true

   [mds.rccephmds2]
   host = rccephmds2
   mds standby_for_rank = rccephmds
   mds standby replay = true


Am I doing something particularly different than what is expected?

Cheers
G.

--
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW  2006
T: +61 2 93511937

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com