Re: [ceph-users] CephFS root squash?

2017-02-10 Thread Jim Kilborn
Interesting. I thought cephfs could be a replacement for a nfs server for 
holding home directories, but not have a single point of failure. I'm surprised 
that is generally frowned upon by the comments.



Sent from my Windows 10 phone



From: John Spray<mailto:jsp...@redhat.com>
Sent: Friday, February 10, 2017 4:21 AM
To: Robert Sander<mailto:r.san...@heinlein-support.de>
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] CephFS root squash?



On Fri, Feb 10, 2017 at 8:02 AM, Robert Sander
<r.san...@heinlein-support.de> wrote:
> On 09.02.2017 20:11, Jim Kilborn wrote:
>
>> I am trying to figure out how to allow my users to have sudo on their 
>> workstation, but not have that root access to the ceph kernel mounted volume.
>
> I do not think that CephFS is meant to be mounted on human users'
> workstations.

We'd all like to avoid squishy human users if possible but sometimes
it's unavoidable :-D

My feeling is that cephfs should be mounted natively only on trusted,
"tightly coupled" systems, whose availability is comparable to that of
the servers.  So on a typical user laptop would be a bad idea, but on
a big visualization workstation might be OK, or on the always-on
identical desktops in a single CAD/CGI/EDA team might be okay too.

Slow/naughty clients generally only cause pain to other clients in the
same filesystem, so if you do have some files accessible to
workstations it might also be prudent to segregate them in a separate
filesystem (currently no cephX way of enforcing that, but if you
basically trust the workstations and just want to isolate them in case
of bugs/outages, it's okay).

John

>
> Regards
> --
> Robert Sander
> Heinlein Support GmbH
> Schwedter Str. 8/9b, 10119 Berlin
>
> http://www.heinlein-support.de
>
> Tel: 030 / 405051-43
> Fax: 030 / 405051-19
>
> Zwangsangaben lt. §35a GmbHG:
> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
> Geschäftsführer: Peer Heinlein -- Sitz: Berlin
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS root squash?

2017-02-09 Thread Jim Kilborn
Does cephfs have an option for root squash, like nfs mounts do?
I am trying to figure out how to allow my users to have sudo on their 
workstation, but not have that root access to the ceph kernel mounted volume.

Can’t seem to find anything. Using cephx for the mount, but can’t find a “root 
squash” type option for mount
sudo still allows them to nuke the whole filesystem from the client.

Sent from Mail for Windows 10

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mon memory issue jewel 10.2.5 kernel 4.4

2017-02-09 Thread Jim Kilborn
Graham,

I don’t think this is the issue I’m seeing. I’m running Centos on kernel 
4.4.24-1. My processes aren’t dying.



I have two clusters with 3 mons in each cluster. Over the last 3 months that 
the clusters have been running, this is only happened on two nodes, and only 
once per node.



If I check the other nodes (or any nodes at this point), I see zero swap used, 
as in the example below.



[jkilborn@darkjedi-ceph02 ~]$ free -h

  totalusedfree  shared  buff/cache   available

Mem:   125G 10G 85G129M 28G108G

Swap:  2.0G  0B2.0G





These mon nodes are also running 8 osds each with ssd journals.

We have very little load at this point. Even when the ceph-mon process eats all 
the swap, it still shows free memory, and never goes offline.



  totalusedfree  shared  buff/cache   available

Mem:  1317838766761800013383516   538685078236061599096>

Swap:   2097148 2097092  56



Seems like a ceph-mon bug/leak to me.





Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10



From: Graham Allan<mailto:g...@umn.edu>
Sent: Thursday, February 9, 2017 11:24 AM
To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] ceph-mon memory issue jewel 10.2.5 kernel 4.4



I've been trying to figure out the same thing recently - I had the same
issues as others with jewel 10.2.3 (?) but for my current problem I
don't think it's a ceph issue.

Specifically ever since our last maintenance day, some of our OSD nodes
having been suffering OSDs killed by OOM killer despite having enough
memory.

I looked for ages at the discussions about reducing the map cache size
but it just didn't seem a likely cause.

It looks like a kernel bug. Here for ubuntu:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1655842

I was seeing this OOM issue on kernels 4.4.0.59 and 4.4.0.62. It sounds
like downgrading into 4.4.0.57 should resolve the issue, and 4.4.0.63
out shortly should also fix it.

Our unaffected machines in the cluster are running a different release
and kernel (though same version of ceph).

Haven't actually tested this yet, just found the reference in the last
hour... could this also be the problem you are seeing?

Graham

On 2/8/2017 6:58 PM, Andrei Mikhailovsky wrote:
> +1
>
> Ever since upgrading to 10.2.x I have been seeing a lot of issues with our 
> ceph cluster. I have been seeing osds down, osd servers running out of memory 
> and killing all ceph-osd processes. Again, 10.2.5 on 4.4.x kernel.
>
> It seems what with every release there are more and more problems with ceph 
> (((, which is a shame.
>
> Andrei
>
> - Original Message -
>> From: "Jim Kilborn" <j...@kilborns.com>
>> To: "ceph-users" <ceph-users@lists.ceph.com>
>> Sent: Wednesday, 8 February, 2017 19:45:58
>> Subject: [ceph-users] ceph-mon memory issue jewel 10.2.5 kernel  4.4
>
>> I have had two ceph monitor nodes generate swap space alerts this week.
>> Looking at the memory, I see ceph-mon using a lot of memory and most of the 
>> swap
>> space. My ceph nodes have 128GB mem, with 2GB swap  (I know the memory/swap
>> ratio is odd)
>>
>> When I get the alert, I see the following
>>
>>
>> root@empire-ceph02 ~]# free
>>
>>  totalusedfree  shared  buff/cache   
>> available
>>
>> Mem:  1317838766761800013383516   5386850782360
>> 61599096
>>
>> Swap:   2097148 2097092  56
>>
>>
>>
>> root@empire-ceph02 ~]# ps -aux | egrep 'ceph-mon|MEM'
>>
>> USERPID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
>>
>> ceph 174239  0.3 45.8 62812848 60405112 ?   Ssl   2016 269:08
>> /usr/bin/ceph-mon -f --cluster ceph --id empire-ceph02 --setuser ceph
>> --setgroup ceph
>>
>>
>> In the ceph-mon log, I see the following:
>>
>> Feb  8 09:31:21 empire-ceph02 ceph-mon: 2017-02-08 09:31:21.211268 
>> 7f414d974700
>> -1 lsb_release_parse - failed to call lsb_release binary with error: (12)
>> Cannot allocate memory
>> Feb  8 09:31:24 empire-ceph02 ceph-osd: 2017-02-08 09:31:24.012856 
>> 7f3dcfe94700
>> -1 osd.8 344 heartbeat_check: no reply from 0x563e4214f090 osd.1 since back
>> 2017-02-08 09:31:03.778901 front 2017-02-08 09:31:03.778901
>> (cutoff 2017-02-08 09:31:04.012854)
>> Feb  8 09:31:24 empire-ceph02 ceph-osd: 2017-02-08 09:31:24.012900 
>> 7f3dcfe94700
>> -1 osd.8 344 heartbeat_check: no reply from 0x563e4214da10 osd.3 since back
>> 2017-02-08 09

Re: [ceph-users] ceph-mon memory issue jewel 10.2.5 kernel 4.4

2017-02-09 Thread Jim Kilborn
Joao,

Here is the information requested. Thanks for taking a look. Note that the 
below is after I restarted the ceph-mon processes yesterday. If this is not 
acceptable, I will have to wait until the issue reappears. This is on a small 
cluster. 4 ceph nodes, and 6 ceph kernel clients running over infiniband.



[root@empire-ceph02 log]# ceph -s

cluster 62ed97d6-adf4-12e4-8fd5-3d9701b22b87

 health HEALTH_OK

 monmap e3: 3 mons at 
{empire-ceph01=192.168.20.241:6789/0,empire-ceph02=192.168.20.242:6789/0,empire-ceph03=192.168.20.243:6789/0}

election epoch 56, quorum 0,1,2 
empire-ceph01,empire-ceph02,empire-ceph03

  fsmap e526: 1/1/1 up {0=empire-ceph03=up:active}, 1 up:standby

 osdmap e361: 32 osds: 32 up, 32 in

flags sortbitwise,require_jewel_osds

  pgmap v2427955: 768 pgs, 2 pools, 2370 GB data, 1759 kobjects

7133 GB used, 109 TB / 116 TB avail

 768 active+clean

  client io 256 B/s wr, 0 op/s rd, 0 op/s wr



[root@empire-ceph02 log]# ceph daemon mon.empire-ceph02 ops

{

"ops": [],

"num_ops": 0

}



[root@empire-ceph02 mon]# du -sh ceph-empire-ceph02

30M ceph-empire-ceph02



[root@empire-ceph02 mon]# ls -lR

.:

total 0

drwxr-xr-x. 3 ceph ceph 46 Dec  6 14:26 ceph-empire-ceph02



./ceph-empire-ceph02:

total 8

-rw-r--r--. 1 ceph ceph0 Dec  6 14:26 done

-rw---. 1 ceph ceph   77 Dec  6 14:26 keyring

drwxr-xr-x. 2 ceph ceph 4096 Feb  9 06:58 store.db



./ceph-empire-ceph02/store.db:

total 30056

-rw-r--r--. 1 ceph ceph  396167 Feb  9 06:06 510929.sst

-rw-r--r--. 1 ceph ceph  778898 Feb  9 06:56 511298.sst

-rw-r--r--. 1 ceph ceph 5177344 Feb  9 07:01 511301.log

-rw-r--r--. 1 ceph ceph 1491740 Feb  9 06:58 511305.sst

-rw-r--r--. 1 ceph ceph 2162405 Feb  9 06:58 511306.sst

-rw-r--r--. 1 ceph ceph 2162047 Feb  9 06:58 511307.sst

-rw-r--r--. 1 ceph ceph 2104201 Feb  9 06:58 511308.sst

-rw-r--r--. 1 ceph ceph 2146113 Feb  9 06:58 511309.sst

-rw-r--r--. 1 ceph ceph 2123659 Feb  9 06:58 511310.sst

-rw-r--r--. 1 ceph ceph 2162927 Feb  9 06:58 511311.sst

-rw-r--r--. 1 ceph ceph 2129640 Feb  9 06:58 511312.sst

-rw-r--r--. 1 ceph ceph 2133590 Feb  9 06:58 511313.sst

-rw-r--r--. 1 ceph ceph 2143906 Feb  9 06:58 511314.sst

-rw-r--r--. 1 ceph ceph 2158434 Feb  9 06:58 511315.sst

-rw-r--r--. 1 ceph ceph 1649589 Feb  9 06:58 511316.sst

-rw-r--r--. 1 ceph ceph  16 Feb  8 13:42 CURRENT

-rw-r--r--. 1 ceph ceph   0 Dec  6 14:26 LOCK

-rw-r--r--. 1 ceph ceph  983040 Feb  9 06:58 MANIFEST-503363





Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10



From: Joao Eduardo Luis<mailto:j...@suse.de>
Sent: Thursday, February 9, 2017 3:06 AM
To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] ceph-mon memory issue jewel 10.2.5 kernel 4.4



Hi Jim,

On 02/08/2017 07:45 PM, Jim Kilborn wrote:
> I have had two ceph monitor nodes generate swap space alerts this week.
> Looking at the memory, I see ceph-mon using a lot of memory and most of the 
> swap space. My ceph nodes have 128GB mem, with 2GB swap  (I know the 
> memory/swap ratio is odd)
>
> When I get the alert, I see the following
[snip]
> root@empire-ceph02 ~]# ps -aux | egrep 'ceph-mon|MEM'
>
> USERPID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
>
> ceph 174239  0.3 45.8 62812848 60405112 ?   Ssl   2016 269:08 
> /usr/bin/ceph-mon -f --cluster ceph --id empire-ceph02 --setuser ceph 
> --setgroup ceph
>
> [snip]
>
>
> Is this a setting issue? Or Maybe a bug?
> When I look at the other ceph-mon processes on other nodes, they aren’t using 
> any swap, and only about 500MB of memory.

Can you get us the result of `ceph -s`, of `ceph daemon mon.ID ops`, and
the size of your monitor's data directory? The latter, ideally,
recursive with the sizes of all the children in the tree (which,
assuming they're a lot, would likely be better on a pastebin).

   -Joao
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-mon memory issue jewel 10.2.5 kernel 4.4

2017-02-08 Thread Jim Kilborn
I have had two ceph monitor nodes generate swap space alerts this week.
Looking at the memory, I see ceph-mon using a lot of memory and most of the 
swap space. My ceph nodes have 128GB mem, with 2GB swap  (I know the 
memory/swap ratio is odd)

When I get the alert, I see the following


root@empire-ceph02 ~]# free

  totalusedfree  shared  buff/cache   available

Mem:  1317838766761800013383516   538685078236061599096

Swap:   2097148 2097092  56



root@empire-ceph02 ~]# ps -aux | egrep 'ceph-mon|MEM'

USERPID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND

ceph 174239  0.3 45.8 62812848 60405112 ?   Ssl   2016 269:08 
/usr/bin/ceph-mon -f --cluster ceph --id empire-ceph02 --setuser ceph 
--setgroup ceph


In the ceph-mon log, I see the following:

Feb  8 09:31:21 empire-ceph02 ceph-mon: 2017-02-08 09:31:21.211268 7f414d974700 
-1 lsb_release_parse - failed to call lsb_release binary with error: (12) 
Cannot allocate memory
Feb  8 09:31:24 empire-ceph02 ceph-osd: 2017-02-08 09:31:24.012856 7f3dcfe94700 
-1 osd.8 344 heartbeat_check: no reply from 0x563e4214f090 osd.1 since back 
2017-02-08 09:31:03.778901 front 2017-02-08 09:31:03.778901
(cutoff 2017-02-08 09:31:04.012854)
Feb  8 09:31:24 empire-ceph02 ceph-osd: 2017-02-08 09:31:24.012900 7f3dcfe94700 
-1 osd.8 344 heartbeat_check: no reply from 0x563e4214da10 osd.3 since back 
2017-02-08 09:31:03.778901 front 2017-02-08 09:31:03.778901
(cutoff 2017-02-08 09:31:04.012854)
Feb  8 09:31:24 empire-ceph02 ceph-osd: 2017-02-08 09:31:24.012915 7f3dcfe94700 
-1 osd.8 344 heartbeat_check: no reply from 0x563e4214d410 osd.5 since back 
2017-02-08 09:31:03.778901 front 2017-02-08 09:31:03.778901
(cutoff 2017-02-08 09:31:04.012854)
Feb  8 09:31:24 empire-ceph02 ceph-osd: 2017-02-08 09:31:24.012927 7f3dcfe94700 
-1 osd.8 344 heartbeat_check: no reply from 0x563e4214e490 osd.6 since back 
2017-02-08 09:31:03.778901 front 2017-02-08 09:31:03.778901
(cutoff 2017-02-08 09:31:04.012854)
Feb  8 09:31:24 empire-ceph02 ceph-osd: 2017-02-08 09:31:24.012934 7f3dcfe94700 
-1 osd.8 344 heartbeat_check: no reply from 0x563e42149a10 osd.7 since back 
2017-02-08 09:31:03.778901 front 2017-02-08 09:31:03.778901
(cutoff 2017-02-08 09:31:04.012854)
Feb  8 09:31:25 empire-ceph02 ceph-osd: 2017-02-08 09:31:25.013038 7f3dcfe94700 
-1 osd.8 345 heartbeat_check: no reply from 0x563e4214f090 osd.1 since back 
2017-02-08 09:31:03.778901 front 2017-02-08 09:31:03.778901
(cutoff 2017-02-08 09:31:05.013020)


Is this a setting issue? Or Maybe a bug?
When I look at the other ceph-mon processes on other nodes, they aren’t using 
any swap, and only about 500MB of memory.

When I restart ceph-mds on the server that shows the issue, the swap frees up, 
and the memory for the new ceph-mon is 500MB again.

Any ideas would be appreciated.


Sent from Mail for Windows 10

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] client.admin accidently removed caps/permissions

2017-01-04 Thread Jim Kilborn
Disregard,   you can fix this by doing using the monitor id and keyring file:



cd /var/lib/ceph/mon/monname

ceph -n mon. --keyring keyring  auth caps client.admin mds 'allow *' osd 'allow 
*' mon 'allow *'





Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10



From: Jim Kilborn<mailto:j...@kilborns.com>
Sent: Wednesday, January 4, 2017 9:19 AM
To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: [ceph-users] client.admin accidently removed caps/permissions



Hello:

I was trying to fix an problem with mds caps, and caused my admin user to have 
no mon caps.
I ran:

ceph auth caps client.admin mds 'allow *'

I didn’t realize I had to pass the mon and osd caps as well. Now, when I try to 
run any command, I get

2017-01-04 08:58:44.009250 7f5441f62700  0 librados: client.admin 
authentication error (13) Permission denied
Error connecting to cluster: PermissionDeniedError

What is the simpliest way to get my client.admin caps/permissions fixed?


Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] client.admin accidently removed caps/permissions

2017-01-04 Thread Jim Kilborn
Hello:

I was trying to fix an problem with mds caps, and caused my admin user to have 
no mon caps.
I ran:

ceph auth caps client.admin mds 'allow *'

I didn’t realize I had to pass the mon and osd caps as well. Now, when I try to 
run any command, I get

2017-01-04 08:58:44.009250 7f5441f62700  0 librados: client.admin 
authentication error (13) Permission denied
Error connecting to cluster: PermissionDeniedError

What is the simpliest way to get my client.admin caps/permissions fixed?


Sent from Mail for Windows 10

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size

2016-10-21 Thread Jim Kilborn
Reed/Christian,

So if I put the OSD journals on an SSD that has power loss protection (Samsung 
SM863) , all the write then go through those journals. Can I then leave write 
caching turn on for the spinner OSDs, even without BBU caching controller? In 
the event of a power outage past our ups time, I want to ensure all the osds 
aren’t corrupt after bring the nodes back up.

Secondly, Seagate 8TB enterprise drives say they employ power loss protection 
as well. Apparently, in your case, this turned out to be untrue?




Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10

From: Reed Dier<mailto:reed.d...@focusvq.com>
Sent: Friday, October 21, 2016 10:06 AM
To: Christian Balzer<mailto:ch...@gol.com>
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache 
pressure, capability release, poor iostat await avg queue size


On Oct 19, 2016, at 7:54 PM, Christian Balzer 
<ch...@gol.com<mailto:ch...@gol.com>> wrote:


Hello,

On Wed, 19 Oct 2016 12:28:28 + Jim Kilborn wrote:

I have setup a new linux cluster to allow migration from our old SAN based 
cluster to a new cluster with ceph.
All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
As others mentioned, not a good choice, but also not the (main) cause of
your problems.

I am basically running stock ceph settings, with just turning the write cache 
off via hdparm on the drives, and temporarily turning of scrubbing.

The former is bound to kill performance, if you care that much for your
data but can't guarantee constant power (UPS, dual PSUs, etc), consider
using a BBU caching controller.

I wanted to comment on this small bolded bit, in the early days of my ceph 
cluster, testing resiliency to power failure (worst case scenario), when the 
on-disk write cache was enabled on my drives, I would lose that OSD to leveldb 
corruption, even with BBU.

With BBU + no disk-level cache, the OSD would come back, with no data loss, 
however performance would be significantly degraded. (xfsaild process with 99% 
iowait, cured by zapping disk and recreating OSD)

For reference, these were Seagate ST8000NM0065, backed by an LSI 3108 RoC, with 
the OSD set as a single RAID0 VD. On disk journaling.

There was a decent enough hit to write performance after disabling write 
caching at the disk layer, but write-back caching at the controller layer 
provided enough of a negating increase, that the data security was an 
acceptable trade off.

Was a tough way to learn how important this was after data center was struck by 
lightning two weeks after initial ceph cluster install and one phase of power 
was knocked out for 15 minutes, taking half the non-dual-PSU nodes with it.

Just want to make sure that people learn from that painful experience.

Reed


The later I venture you did because performance was abysmal with scrubbing
enabled.
Which is always a good indicator that your cluster needs tuning, improving.

The 4 ceph servers are all Dell 730XD with 128GB memory, and dual xeon. So 
Server performance should be good.
Memory is fine, CPU I can't tell from the model number and I'm not
inclined to look up or guess, but that usually only becomes a bottleneck
when dealing with all SSD setup and things requiring the lowest latency
possible.


Since I am running cephfs, I have tiering setup.
That should read "on top of EC pools", and as John said, not a good idea
at all, both EC pools and cache-tiering.

Each server has 4 – 4TB drives for the erasure code pool, with K=3 and M=1. So 
the idea is to ensure a single host failure.
Each server also has a 1TB Seagate 850 Pro SSD for the cache drive, in a 
replicated set with size=2

This isn't a Seagate, you mean Samsung. And that's a consumer model,
ill suited for this task, even with the DC level SSDs below as journals.

And as such a replication of 2 is also ill advised, I've seen these SSDs
die w/o ANY warning whatsoever and long before their (abysmal) endurance
was exhausted.

The cache tier also has a 128GB SM863 SSD that is being used as a journal for 
the cache SSD. It has power loss protection

Those are fine. If you re-do you cluster, don't put more than 4-5 journals
on them.

My crush map is setup to ensure the cache pool uses only the 4 850 pro and the 
erasure code uses only the 16 spinning 4TB drives.

The problems that I am seeing is that I start copying data from our old san to 
the ceph volume, and once the cache tier gets to my  target_max_bytes of 1.4 
TB, I start seeing:

HEALTH_WARN 63 requests are blocked > 32 sec; 1 osds have slow requests; 
noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
26 ops are blocked > 65.536 sec on osd.0
37 ops are blocked > 32.768 sec on osd.0
1 osds have slow requests
noout,noscrub,nodeep-scrub,sortbitwise flag(s) set

osd.0 is the cache ssd

If I watch iostat on the cache ssd, I see the queue lengt

Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size

2016-10-20 Thread Jim Kilborn
The chart obviously didn’t go well. Here it is again



fio --direct=1 --sync=1 --rw={write,randwrite,read,randread} --bs={4M,4K} 
--numjobs=1 --iodepth=1 --runtime=60 --size=5G --time_based --group_reporting 
--name=journal-test



FIO Test Local disk  SAN/NFS
  Ceph size=3/SSD journal

4M Writes  53 MB/sec   12 IOPS 62 MB/sec15 IOPS
151 MB/sec 37 IOPS

4M Rand Writes34 MB/sec 8 IOPS 63 MB/sec15 IOPS155 
MB/sec 37 IOPS

4M Read  66 MB/sec   15 IOPS   102 MB/sec25 IOPS   
662 MB/sec 161 IOPS

4M Rand Read73 MB/sec   17 IOPS   103 MB/sec25 IOPS   670 
MB/sec 163 IOPS

4K Writes2.9 MB/sec 738 IOPS   3.8 MB/sec   952 IOPS
2.3 MB/sec 571 IOPS

4K Rand Writes 551 KB/sec  134 IOPS   3.6 MB/sec   911 IOPS   2.0 
MB/sec 501 IOPS

4K Read  28 MB/sec 7001 IOPS8 MB/sec 1945 IOPS  
 13 MB/sec 3256 IOPS

4K Rand Read 263 KB/sec5 MB/sec 1246 IOPS   
  8 MB/sec  2015 IOPS



Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10



From: Jim Kilborn<mailto:j...@kilborns.com>
Sent: Thursday, October 20, 2016 10:20 AM
To: Christian Balzer<mailto:ch...@gol.com>; 
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache 
pressure, capability release, poor iostat await avg queue size



Thanks Christion for the additional information and comments.



· upgraded the kernels, but still had poor performance

· Removed all the pools and recreated with just a replication of 3, 
with the two pool for the data and metadata. No cache tier pool

· Turned back on the write caching with hdparm. We do have a Large UPS 
and dual power supplies in the ceph unit. If we get a long power outage, 
everything will go down anyway.



I am no longer seeing the issue of the slow requests, ops blocked, etc.



I think I will push for the following design per ceph server



8  4TB sata drives

2 Samsung 128GB SM863 SSD each holding 4 osd journals



With 4 hosts, and a replication of 3 to start with



I did a quick test with 4 - 4TB spinners and 1 Samsung 128GB SM863 SSD holding 
the  4 osd journals, with 4 hosts in the cluster over infiniband.



At the 4M read, watching iftop, the client is receiving between  4.5 GB/sec - 
5.5Gb/sec over infiniband

Which is around 600MB/sec and translates well to the FIO number.



fio --direct=1 --sync=1 --rw={write,randwrite,read,randread} --bs={4M,4K} 
--numjobs=1 --iodepth=1 --runtime=60 --size=5G --time_based --group_reporting 
--name=journal-test



FIO Test


Local disk


SAN/NFS


Ceph w/Repl/SSD journal


4M Writes


53 MB/sec   12 IOPS


62 MB/sec15 IOPS


  151 MB/sec 37 IOPS


4M Rand Writes


34 MB/sec 8 IOPS


63 MB/sec15 IOPS


  155 MB/sec 37 IOPS


4M Read


66 MB/sec   15 IOPS


102 MB/sec  25 IOPS


  662 MB/sec 161 IOPS


4M Rand Read


73 MB/sec   17 IOPS


103 MB/sec  25 IOPS


  670 MB/sec 163 IOPS














4K Writes


2.9 MB/sec 738 IOPS


3.8 MB/sec   952 IOPS


  2.3 MB/sec 571 IOPS


4K Rand Writes


551 KB/sec  134 IOPS


3.6 MB/sec   911 IOPS


  2.0 MB/sec 501 IOPS


4K Read


28 MB/sec 7001 IOPS


8 MB/sec 1945 IOPS


  13 MB/sec 3256 IOPS


4K Rand Read


263 KB/sec


5 MB/sec 1246 IOPS


  8 MB/sec  2015 IOPS




That performance is fine for our needs

Again, thanks for the help guys.



Regards,

Jim



From: Christian Balzer<mailto:ch...@gol.com>
Sent: Wednesday, October 19, 2016 7:54 PM
To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Cc: Jim Kilborn<mailto:j...@kilborns.com>
Subject: Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache 
pressure, capability release, poor iostat await avg queue size



Hello,

On Wed, 19 Oct 2016 12:28:28 + Jim Kilborn wrote:

> I have setup a new linux cluster to allow migration from our old SAN based 
> cluster to a new cluster with ceph.
> All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
As others mentioned, not a good choice, but also not the (main) cause of
your problems.

> I am basically running stock ceph settings, with just turning the write cache 
> off via hdparm on the drives, and temporarily turning of scrubbing.
>
The former is bound to kill performance, if you care that much for your
data but can't guarantee constant power (UPS, dual PSUs, etc), consider
using a BBU caching controller.

The later I venture you did because performance was abysmal with scrubbing
enabled.
Which is always a good indicator that your cluster needs tuning, improving.

> The 4 ceph servers are all Dell 730XD with 128GB memory, and dual xeon. So 
> Server performance should be good.
Memory is fine, CPU I can't t

Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size

2016-10-20 Thread Jim Kilborn
Thanks Christion for the additional information and comments.



· upgraded the kernels, but still had poor performance

· Removed all the pools and recreated with just a replication of 3, 
with the two pool for the data and metadata. No cache tier pool

· Turned back on the write caching with hdparm. We do have a Large UPS 
and dual power supplies in the ceph unit. If we get a long power outage, 
everything will go down anyway.



I am no longer seeing the issue of the slow requests, ops blocked, etc.



I think I will push for the following design per ceph server



8  4TB sata drives

2 Samsung 128GB SM863 SSD each holding 4 osd journals



With 4 hosts, and a replication of 3 to start with



I did a quick test with 4 - 4TB spinners and 1 Samsung 128GB SM863 SSD holding 
the  4 osd journals, with 4 hosts in the cluster over infiniband.



At the 4M read, watching iftop, the client is receiving between  4.5 GB/sec - 
5.5Gb/sec over infiniband

Which is around 600MB/sec and translates well to the FIO number.



fio --direct=1 --sync=1 --rw={write,randwrite,read,randread} --bs={4M,4K} 
--numjobs=1 --iodepth=1 --runtime=60 --size=5G --time_based --group_reporting 
--name=journal-test



FIO Test


Local disk


SAN/NFS


Ceph w/Repl/SSD journal


4M Writes


53 MB/sec   12 IOPS


62 MB/sec15 IOPS


  151 MB/sec 37 IOPS


4M Rand Writes


34 MB/sec 8 IOPS


63 MB/sec15 IOPS


  155 MB/sec 37 IOPS


4M Read


66 MB/sec   15 IOPS


102 MB/sec  25 IOPS


  662 MB/sec 161 IOPS


4M Rand Read


73 MB/sec   17 IOPS


103 MB/sec  25 IOPS


  670 MB/sec 163 IOPS














4K Writes


2.9 MB/sec 738 IOPS


3.8 MB/sec   952 IOPS


  2.3 MB/sec 571 IOPS


4K Rand Writes


551 KB/sec  134 IOPS


3.6 MB/sec   911 IOPS


  2.0 MB/sec 501 IOPS


4K Read


28 MB/sec 7001 IOPS


8 MB/sec 1945 IOPS


  13 MB/sec 3256 IOPS


4K Rand Read


263 KB/sec


5 MB/sec 1246 IOPS


  8 MB/sec  2015 IOPS




That performance is fine for our needs

Again, thanks for the help guys.



Regards,

Jim



From: Christian Balzer<mailto:ch...@gol.com>
Sent: Wednesday, October 19, 2016 7:54 PM
To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Cc: Jim Kilborn<mailto:j...@kilborns.com>
Subject: Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache 
pressure, capability release, poor iostat await avg queue size



Hello,

On Wed, 19 Oct 2016 12:28:28 + Jim Kilborn wrote:

> I have setup a new linux cluster to allow migration from our old SAN based 
> cluster to a new cluster with ceph.
> All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
As others mentioned, not a good choice, but also not the (main) cause of
your problems.

> I am basically running stock ceph settings, with just turning the write cache 
> off via hdparm on the drives, and temporarily turning of scrubbing.
>
The former is bound to kill performance, if you care that much for your
data but can't guarantee constant power (UPS, dual PSUs, etc), consider
using a BBU caching controller.

The later I venture you did because performance was abysmal with scrubbing
enabled.
Which is always a good indicator that your cluster needs tuning, improving.

> The 4 ceph servers are all Dell 730XD with 128GB memory, and dual xeon. So 
> Server performance should be good.
Memory is fine, CPU I can't tell from the model number and I'm not
inclined to look up or guess, but that usually only becomes a bottleneck
when dealing with all SSD setup and things requiring the lowest latency
possible.


> Since I am running cephfs, I have tiering setup.
That should read "on top of EC pools", and as John said, not a good idea
at all, both EC pools and cache-tiering.

> Each server has 4 – 4TB drives for the erasure code pool, with K=3 and M=1. 
> So the idea is to ensure a single host failure.
> Each server also has a 1TB Seagate 850 Pro SSD for the cache drive, in a 
> replicated set with size=2

This isn't a Seagate, you mean Samsung. And that's a consumer model,
ill suited for this task, even with the DC level SSDs below as journals.

And as such a replication of 2 is also ill advised, I've seen these SSDs
die w/o ANY warning whatsoever and long before their (abysmal) endurance
was exhausted.

> The cache tier also has a 128GB SM863 SSD that is being used as a journal for 
> the cache SSD. It has power loss protection

Those are fine. If you re-do you cluster, don't put more than 4-5 journals
on them.

> My crush map is setup to ensure the cache pool uses only the 4 850 pro and 
> the erasure code uses only the 16 spinning 4TB drives.
>
> The problems that I am seeing is that I start copying data from our old san 
> to the ceph volume, and once the cache tier gets to my  target_max_bytes of 
> 1.4 TB, I start seeing:
>
> HEALTH_WARN 63 requests are blocked > 32 sec; 1 osds have slow requests; 
> noout,noscrub,nodeep-scr

Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size

2016-10-19 Thread Jim Kilborn
John,



Updating to the latest mainline kernel from elrepo (4.8.2-1) on all 4 ceph 
servers, and the ceph client that I am testing with, still didn’t fix the 
issues.

Still getting “Failing to respond to Cache Pressure”. And ops block currently 
hovering between 100-300 requests  > 32 sec



This is just from doing an rsync from one ceph client (reading data from an old 
san over nfs, and writing to ceph cluster on infiniband



I guess I’ll try getting rid of the EC pool and the cache tier, and just using 
replication with size 3 and see if it works better



Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10



From: John Spray<mailto:jsp...@redhat.com>
Sent: Wednesday, October 19, 2016 12:16 PM
To: Jim Kilborn<mailto:j...@kilborns.com>
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache 
pressure, capability release, poor iostat await avg queue size



On Wed, Oct 19, 2016 at 5:17 PM, Jim Kilborn <j...@kilborns.com> wrote:
> John,
>
>
>
> Thanks for the tips….
>
> Unfortunately, I was looking at this page 
> http://docs.ceph.com/docs/jewel/start/os-recommendations/

OK, thanks - I've pushed an update to clarify that
(https://github.com/ceph/ceph/pull/11564).

> I’ll consider either upgrading the kernels or using the fuse client, but will 
> likely go the kernel 4.4 route
>
>
>
> As for moving to just a replicated pool, I take it that a replication size of 
> 3 is minimum recommended.
>
> If I move to no EC, I will have to have have 9 4TB spinners on of the 4 
> servers. Can I put the 9 journals on the one 128GB ssd with 10GB per journal, 
> or is that two many osds per journal, creating a hot spot for writes?

That sounds like a lot of journals on one SSD, but people other than
me have more empirical experience in hardware selection.

John

>
>
>
> Thanks!!
>
>
>
>
>
>
>
> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
>
>
>
> From: John Spray<mailto:jsp...@redhat.com>
> Sent: Wednesday, October 19, 2016 9:10 AM
> To: Jim Kilborn<mailto:j...@kilborns.com>
> Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
> Subject: Re: [ceph-users] New cephfs cluster performance issues- Jewel - 
> cache pressure, capability release, poor iostat await avg queue size
>
>
>
> On Wed, Oct 19, 2016 at 1:28 PM, Jim Kilborn <j...@kilborns.com> wrote:
>> I have setup a new linux cluster to allow migration from our old SAN based 
>> cluster to a new cluster with ceph.
>> All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
>> I am basically running stock ceph settings, with just turning the write 
>> cache off via hdparm on the drives, and temporarily turning of scrubbing.
>>
>> The 4 ceph servers are all Dell 730XD with 128GB memory, and dual xeon. So 
>> Server performance should be good.  Since I am running cephfs, I have 
>> tiering setup.
>> Each server has 4 – 4TB drives for the erasure code pool, with K=3 and M=1. 
>> So the idea is to ensure a single host failure.
>> Each server also has a 1TB Seagate 850 Pro SSD for the cache drive, in a 
>> replicated set with size=2
>> The cache tier also has a 128GB SM863 SSD that is being used as a journal 
>> for the cache SSD. It has power loss protection
>> My crush map is setup to ensure the cache pool uses only the 4 850 pro and 
>> the erasure code uses only the 16 spinning 4TB drives.
>>
>> The problems that I am seeing is that I start copying data from our old san 
>> to the ceph volume, and once the cache tier gets to my  target_max_bytes of 
>> 1.4 TB, I start seeing:
>>
>> HEALTH_WARN 63 requests are blocked > 32 sec; 1 osds have slow requests; 
>> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
>> 26 ops are blocked > 65.536 sec on osd.0
>> 37 ops are blocked > 32.768 sec on osd.0
>> 1 osds have slow requests
>> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
>>
>> osd.0 is the cache ssd
>>
>> If I watch iostat on the cache ssd, I see the queue lengths are high and the 
>> await are high
>> Below is the iostat on the cache drive (osd.0) on the first host. The 
>> avgqu-sz is between 87 and 182 and the await is between 88ms and 1193ms
>>
>> Device:   rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
>> avgqu-sz   await r_await w_await  svctm  %util
>> sdb
>>   0.00 0.339.00   84.33 0.9620.11   462.40   
>>  75.92  397.56  125.67  426.58  10.70  99.90
>>   0.00 0.67   30.00   87.

Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size

2016-10-19 Thread Jim Kilborn
John,



Thanks for the tips….

Unfortunately, I was looking at this page 
http://docs.ceph.com/docs/jewel/start/os-recommendations/

I’ll consider either upgrading the kernels or using the fuse client, but will 
likely go the kernel 4.4 route



As for moving to just a replicated pool, I take it that a replication size of 3 
is minimum recommended.

If I move to no EC, I will have to have have 9 4TB spinners on of the 4 
servers. Can I put the 9 journals on the one 128GB ssd with 10GB per journal, 
or is that two many osds per journal, creating a hot spot for writes?



Thanks!!







Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10



From: John Spray<mailto:jsp...@redhat.com>
Sent: Wednesday, October 19, 2016 9:10 AM
To: Jim Kilborn<mailto:j...@kilborns.com>
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache 
pressure, capability release, poor iostat await avg queue size



On Wed, Oct 19, 2016 at 1:28 PM, Jim Kilborn <j...@kilborns.com> wrote:
> I have setup a new linux cluster to allow migration from our old SAN based 
> cluster to a new cluster with ceph.
> All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
> I am basically running stock ceph settings, with just turning the write cache 
> off via hdparm on the drives, and temporarily turning of scrubbing.
>
> The 4 ceph servers are all Dell 730XD with 128GB memory, and dual xeon. So 
> Server performance should be good.  Since I am running cephfs, I have tiering 
> setup.
> Each server has 4 – 4TB drives for the erasure code pool, with K=3 and M=1. 
> So the idea is to ensure a single host failure.
> Each server also has a 1TB Seagate 850 Pro SSD for the cache drive, in a 
> replicated set with size=2
> The cache tier also has a 128GB SM863 SSD that is being used as a journal for 
> the cache SSD. It has power loss protection
> My crush map is setup to ensure the cache pool uses only the 4 850 pro and 
> the erasure code uses only the 16 spinning 4TB drives.
>
> The problems that I am seeing is that I start copying data from our old san 
> to the ceph volume, and once the cache tier gets to my  target_max_bytes of 
> 1.4 TB, I start seeing:
>
> HEALTH_WARN 63 requests are blocked > 32 sec; 1 osds have slow requests; 
> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
> 26 ops are blocked > 65.536 sec on osd.0
> 37 ops are blocked > 32.768 sec on osd.0
> 1 osds have slow requests
> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
>
> osd.0 is the cache ssd
>
> If I watch iostat on the cache ssd, I see the queue lengths are high and the 
> await are high
> Below is the iostat on the cache drive (osd.0) on the first host. The 
> avgqu-sz is between 87 and 182 and the await is between 88ms and 1193ms
>
> Device:   rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz 
>   await r_await w_await  svctm  %util
> sdb
>   0.00 0.339.00   84.33 0.9620.11   462.40
> 75.92  397.56  125.67  426.58  10.70  99.90
>   0.00 0.67   30.00   87.33 5.9621.03   471.20
> 67.86  910.95   87.00 1193.99   8.27  97.07
>   0.0016.67   33.00  289.33 4.2118.80   146.20
> 29.83   88.99   93.91   88.43   3.10  99.83
>   0.00 7.337.67  261.67 1.9219.63   163.81   
> 117.42  331.97  182.04  336.36   3.71 100.00
>
>
> If I look at the iostat for all the drives, only the cache ssd drive is 
> backed up
>
> Device:   rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz 
>   await r_await w_await  svctm  %util
> Sdg (journal for cache drive)
>   0.00 6.330.008.00 0.00 0.0719.04
>  0.000.330.000.33   0.33   0.27
> Sdb (cache drive)
>   0.00 0.333.33   82.00 0.8320.07   501.68   
> 106.75 1057.81  269.40 1089.86  11.72 100.00
> Sda (4TB EC)
>   0.00 0.000.004.00 0.00 0.02 9.33
>  0.000.000.000.00   0.00   0.00
> Sdd (4TB EC)
>   0.00 0.000.002.33 0.00 0.45   392.00
>  0.08   34.000.00   34.00   6.86   1.60
> Sdf (4TB EC)
>   0.0014.000.00   26.00 0.00 0.2217.71
>  1.00   38.550.00   38.55   0.68   1.77
> Sdc (4TB EC)
>   0.00 0.000.001.33 0.00 0.01 8.75
>  0.02   12.250.00   12.25  12.25   1.63
>
> While at this time is just complaining about slow osd.0, sometimes the other 
> cache tier ssds show some slow response, but not as frequently.
>
>
> I 

[ceph-users] New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size

2016-10-19 Thread Jim Kilborn
I have setup a new linux cluster to allow migration from our old SAN based 
cluster to a new cluster with ceph.
All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
I am basically running stock ceph settings, with just turning the write cache 
off via hdparm on the drives, and temporarily turning of scrubbing.

The 4 ceph servers are all Dell 730XD with 128GB memory, and dual xeon. So 
Server performance should be good.  Since I am running cephfs, I have tiering 
setup.
Each server has 4 – 4TB drives for the erasure code pool, with K=3 and M=1. So 
the idea is to ensure a single host failure.
Each server also has a 1TB Seagate 850 Pro SSD for the cache drive, in a 
replicated set with size=2
The cache tier also has a 128GB SM863 SSD that is being used as a journal for 
the cache SSD. It has power loss protection
My crush map is setup to ensure the cache pool uses only the 4 850 pro and the 
erasure code uses only the 16 spinning 4TB drives.

The problems that I am seeing is that I start copying data from our old san to 
the ceph volume, and once the cache tier gets to my  target_max_bytes of 1.4 
TB, I start seeing:

HEALTH_WARN 63 requests are blocked > 32 sec; 1 osds have slow requests; 
noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
26 ops are blocked > 65.536 sec on osd.0
37 ops are blocked > 32.768 sec on osd.0
1 osds have slow requests
noout,noscrub,nodeep-scrub,sortbitwise flag(s) set

osd.0 is the cache ssd

If I watch iostat on the cache ssd, I see the queue lengths are high and the 
await are high
Below is the iostat on the cache drive (osd.0) on the first host. The avgqu-sz 
is between 87 and 182 and the await is between 88ms and 1193ms

Device:   rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz   
await r_await w_await  svctm  %util
sdb
  0.00 0.339.00   84.33 0.9620.11   462.40
75.92  397.56  125.67  426.58  10.70  99.90
  0.00 0.67   30.00   87.33 5.9621.03   471.20
67.86  910.95   87.00 1193.99   8.27  97.07
  0.0016.67   33.00  289.33 4.2118.80   146.20
29.83   88.99   93.91   88.43   3.10  99.83
  0.00 7.337.67  261.67 1.9219.63   163.81   
117.42  331.97  182.04  336.36   3.71 100.00


If I look at the iostat for all the drives, only the cache ssd drive is backed 
up

Device:   rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz   
await r_await w_await  svctm  %util
Sdg (journal for cache drive)
  0.00 6.330.008.00 0.00 0.0719.04 
0.000.330.000.33   0.33   0.27
Sdb (cache drive)
  0.00 0.333.33   82.00 0.8320.07   501.68   
106.75 1057.81  269.40 1089.86  11.72 100.00
Sda (4TB EC)
  0.00 0.000.004.00 0.00 0.02 9.33 
0.000.000.000.00   0.00   0.00
Sdd (4TB EC)
  0.00 0.000.002.33 0.00 0.45   392.00 
0.08   34.000.00   34.00   6.86   1.60
Sdf (4TB EC)
  0.0014.000.00   26.00 0.00 0.2217.71 
1.00   38.550.00   38.55   0.68   1.77
Sdc (4TB EC)
  0.00 0.000.001.33 0.00 0.01 8.75 
0.02   12.250.00   12.25  12.25   1.63

While at this time is just complaining about slow osd.0, sometimes the other 
cache tier ssds show some slow response, but not as frequently.


I occasionally see complaints about a client not responding to cache pressure, 
and yesterday while copying serveral terabytes, the client doing the copy was 
noted for failing to respond to capability release, and I ended up rebooting it.

I just seems the cluster isn’t handling large amounts of data copies, like and 
nfs or san based volume would, and I am worried about moving our users to a 
cluster that already is showing signs of performance issues, even when I am 
just doing a copy with no other users. I am doing only one rsync at a time.

Is the problem that I need to user a later kernel for the clients mounting the 
volume ? I have read some posts about that, but the docs say centos 7 with 3.10 
is ok.
Do I need more drives in my cache pool? I only have 4 ssd drive in the cache 
pool (one on each host), with each having a separate journal drive.
But is that too much of a hot spot since all i/o has to go to the cache layer?
It seems like my ssds should be able to keep up with a single rsync copy.
Is there something set wrong on my ssds that they cant keep up?
I put the metadata pool on the ssd cache tier drives as well.

Any ideas where the problem is or what I need to change to make this stable?


Thanks. Additional details below

The ceph osd tier drives are osd 0,  5,  10, 15

ceph df
GLOBAL:
SIZE   AVAIL  RAW USED %RAW USED
63155G 50960G   12195G 19.31
POOLS:
NAMEID USED   %USED MAX AVAIL OBJECTS

Re: [ceph-users] cache tier not flushing 10.2.2

2016-09-20 Thread Jim Kilborn
Please disregard this. I have a error in my target_max_bytes, that was causing 
the issue. I now have it evicting the cache.







Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10



From: Jim Kilborn<mailto:j...@kilborns.com>
Sent: Tuesday, September 20, 2016 12:59 PM
To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: [ceph-users] cache tier not flushing 10.2.2



Simple issue I cant find with the cache tier. Thanks for taking the time…

Setup a new cluster with ssd cache tier. My cache tier is on 1TB ssd. With 2 
replicas. It just fills up my cache until the ceph filesystem stops allowing 
access.
I even set the target_max_bytes to 1048576 (1GB) and still doesn’t flush.

Here are the settings:

Setup the pools

ceph osd pool create cephfs-cache 512 512 replicated ssd_ruleset
ceph osd pool create cephfs-metadata 512 512 replicated ssd_ruleset
ceph osd pool create cephfs-data 512 512 erasure default spinning_ruleset
ceph osd pool set cephfs-cache min_size 1
ceph osd pool set cephfs-cache size 2
ceph osd pool set cephfs-metadata min_size 1
ceph osd pool set cephfs-metadata size 2



Add tiers

ceph osd tier add cephfs-data cephfs-cache
ceph osd tier cache-mode cephfs-cache writeback
ceph osd tier set-overlay cephfs-data cephfs-cache
ceph osd pool set cephfs-cache hit_set_type bloom
ceph osd pool set cephfs-cache hit_set_count 1
ceph osd pool set cephfs-cache hit_set_period 3600
ceph osd pool set cephfs-cache target_max_bytes 1048576 # 1 TB
ceph osd pool set cephfs-cache cache_target_dirty_ratio 0.4 # percentage of 
target_max_bytes before flushes dirty objects
ceph osd pool set cephfs-cache cache_target_dirty_high_ratio 0.6 # percentage 
of target_max_bytes before flushes dirty objects more aggressively
ceph osd pool set cephfs-cache cache_target_full_ratio 0.80 # percentage of 
cache full before evicts objects


Am I missing something stupid? Must be. I can cause it to flush with
rados -p cephfs-cache cache-try-flush-evict-all

Should my metadata not be on the same pool as the cache pool?

I cant figure out why it doesn’t start flushing when I copy over 2 GB data. It 
just goes to
'cephfs-cache' at/near target max

Regards,
Jim

Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cache tier not flushing 10.2.2

2016-09-20 Thread Jim Kilborn
Simple issue I cant find with the cache tier. Thanks for taking the time…

Setup a new cluster with ssd cache tier. My cache tier is on 1TB ssd. With 2 
replicas. It just fills up my cache until the ceph filesystem stops allowing 
access.
I even set the target_max_bytes to 1048576 (1GB) and still doesn’t flush.

Here are the settings:

Setup the pools

ceph osd pool create cephfs-cache 512 512 replicated ssd_ruleset
ceph osd pool create cephfs-metadata 512 512 replicated ssd_ruleset
ceph osd pool create cephfs-data 512 512 erasure default spinning_ruleset
ceph osd pool set cephfs-cache min_size 1
ceph osd pool set cephfs-cache size 2
ceph osd pool set cephfs-metadata min_size 1
ceph osd pool set cephfs-metadata size 2



Add tiers

ceph osd tier add cephfs-data cephfs-cache
ceph osd tier cache-mode cephfs-cache writeback
ceph osd tier set-overlay cephfs-data cephfs-cache
ceph osd pool set cephfs-cache hit_set_type bloom
ceph osd pool set cephfs-cache hit_set_count 1
ceph osd pool set cephfs-cache hit_set_period 3600
ceph osd pool set cephfs-cache target_max_bytes 1048576 # 1 TB
ceph osd pool set cephfs-cache cache_target_dirty_ratio 0.4 # percentage of 
target_max_bytes before flushes dirty objects
ceph osd pool set cephfs-cache cache_target_dirty_high_ratio 0.6 # percentage 
of target_max_bytes before flushes dirty objects more aggressively
ceph osd pool set cephfs-cache cache_target_full_ratio 0.80 # percentage of 
cache full before evicts objects


Am I missing something stupid? Must be. I can cause it to flush with
rados -p cephfs-cache cache-try-flush-evict-all

Should my metadata not be on the same pool as the cache pool?

I cant figure out why it doesn’t start flushing when I copy over 2 GB data. It 
just goes to
'cephfs-cache' at/near target max

Regards,
Jim

Sent from Mail for Windows 10

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] mds damage detected - Jewel

2016-09-15 Thread Jim Kilborn
I have a replicated cache pool and metadata pool which reside on ssd drives, 
with a size of 2, backed by a erasure coded data pool
The cephfs filesystem was in a healthy state. I pulled an SSD drive, to perform 
an exercise in osd failure.

The cluster recognized the ssd failure, and replicated back to a healthy state, 
but I got a message saying the mds0 Metadata damage detected.


   cluster 62ed97d6-adf4-12e4-8fd5-3d9701b22b86
 health HEALTH_ERR
mds0: Metadata damage detected
mds0: Client master01.div18.swri.org failing to respond to cache 
pressure
 monmap e2: 3 mons at 
{ceph01=192.168.19.241:6789/0,ceph02=192.168.19.242:6789/0,ceph03=192.168.19.243:6789/0}
election epoch 24, quorum 0,1,2 
ceph01,darkjedi-ceph02,darkjedi-ceph03
  fsmap e25: 1/1/1 up {0=-ceph04=up:active}, 1 up:standby
 osdmap e1327: 20 osds: 20 up, 20 in
flags sortbitwise
  pgmap v11630: 1536 pgs, 3 pools, 100896 MB data, 442 kobjects
201 GB used, 62915 GB / 63116 GB avail
1536 active+clean

In the mds logs of the active mds, I see the following:

7fad0c4b2700  0 -- 192.168.19.244:6821/1 >> 192.168.19.243:6805/5090 
pipe(0x7fad25885400 sd=56 :33513 s=1 pgs=0 cs=0 l=1 c=0x7fad2585f980).fault
7fad14add700  0 mds.beacon.darkjedi-ceph04 handle_mds_beacon no longer laggy
7fad101d3700  0 mds.0.cache.dir(1016c08) _fetched missing object for [dir 
1016c08 /usr/ [2,head] auth v=0 cv=0/0 ap=1+0+0 state=1073741952 f() n() 
hs=0+0,ss=0+0 | waiter=1 authpin=1 0x7fad25ced500]
7fad101d3700 -1 log_channel(cluster) log [ERR] : dir 1016c08 object missing 
on disk; some files may be lost
7fad0f9d2700  0 -- 192.168.19.244:6821/1 >> 192.168.19.242:6800/3746 
pipe(0x7fad25a4e800 sd=42 :0 s=1 pgs=0 cs=0 l=1 c=0x7fad25bd5180).fault
7fad14add700 -1 log_channel(cluster) log [ERR] : unmatched fragstat size on 
single dirfrag 1016c08, inode has f(v0 m2016-09-14 14:00:36.654244 
13=1+12), dirfrag has f(v0 m2016-09-14 14:00:36.654244 1=0+1)
7fad14add700 -1 log_channel(cluster) log [ERR] : unmatched rstat rbytes on 
single dirfrag 1016c08, inode has n(v77 rc2016-09-14 14:00:36.654244 
b1533163206 48173=43133+5040), dirfrag has n(v77 rc2016-09-14 14:00:36.654244 
1=0+1)
7fad101d3700 -1 log_channel(cluster) log [ERR] : unmatched rstat on 
1016c08, inode has n(v78 rc2016-09-14 14:00:36.656244 2=0+2), dirfrags have 
n(v0 rc2016-09-14 14:00:36.656244 3=0+3)

I’m not sure why the metadata got damaged, since its being replicated, but I 
want to fix the issue, and test again. However, I cant figure out the steps to 
repair the metadata.
I saw something about running a damage ls, but I can’t seem to find a more 
detailed repair document. Any pointers to get the metadata fixed? Seems both my 
mds daemons are running correctly, but that error bothers me. Shouldn’t happen 
I think.

I tried the following command, but it doesn’t understand it….
ceph --admin-daemon /var/run/ceph/ceph-mds. ceph03.asok damage ls


I then rebooted all 4 ceph servers simultaneously (another stress test), and 
the ceph cluster came back up healthy, and the mds damaged status has been 
cleared!!  I  then replaced the ssd, put it back into service, and let the 
backfill complete. The cluster was fully healthy. I pulled another ssd, and 
repeated this process, yet I never got the damaged mds messages. Was this just 
a random metadata damage due to yanking a drive out? Is there any lingering 
affects of the metadata that I need to address?


-  Jim

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Replacing a failed OSD

2016-09-14 Thread Jim Kilborn
Reed,



Thanks for the response.



Your process is the one that I ran. However, I have a crushmap with ssd and 
sata drives in different buckets (host made up of host types, with and ssd and 
spinning hosttype for each host) because I am using ssd drives for a replicated 
cache in front of an erasure code data for cephfs.



I have “osd crush update on start = false” so that osds don’t randomly get 
added to the crush map, because it wouldn’t know where to put that osd.



I am using puppet to provision the drives when it sees one in a slot and it 
doesn’t see the ceph signature (I guess). I am using the ceph puppet module.



The real confusion is why I have to remove it from the crush map. Once I remove 
it from the crush map, it does bring it up as the same osd number, but its not 
in the crush map, so I have to put it back where it belongs. Just seems strange 
that it must be removed from the crush map.



Basically, I export the crush map, remove the osd from the crush map, then 
redeploy the drive. Then when it gets up and running as the same osd number, I 
import the exported crush map to get it back in the cluster.



I guess that is just how it has to be done.



Thanks again



Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10



From: Reed Dier<mailto:reed.d...@focusvq.com>
Sent: Wednesday, September 14, 2016 1:39 PM
To: Jim Kilborn<mailto:j...@kilborns.com>
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Replacing a failed OSD



Hi Jim,

This is pretty fresh in my mind so hopefully I can help you out here.

Firstly, the crush map will back fill any holes in the enumeration that are 
existing. So assuming only one drive has been removed from the crush map, it 
will repopulate the same OSD number.

My steps for removing an OSD are run from the host node:

> ceph osd down osd.i
> ceph osd out osd.i
> stop ceph-osd id=i
> umount /var/lib/ceph/osd/ceph-i
> ceph osd crush remove osd.i
> ceph auth del osd.i
> ceph osd rm osd.i


>From here, the disk is removed from the ceph cluster, crush map, and is ready 
>for removal and replacement.

>From there I deploy the new osd with ceph-deploy from my admin node using:

> ceph-deploy disk list nodei
> ceph-deploy disk zap nodei:sdX
> ceph-deploy --overwrite-conf osd prepare nodei:sdX


This will prepare the disk and insert it back into the crush map, bringing it 
back up and in. The OSD number should remain the same, as it will fill the gap 
left from the previous OSD removal.

Hopefully this helps,

Reed

> On Sep 14, 2016, at 11:00 AM, Jim Kilborn <j...@kilborns.com> wrote:
>
> I am finishing testing our new cephfs cluster and wanted to document a failed 
> osd procedure.
> I noticed that when I pulled a drive, to simulate a failure, and run through 
> the replacement steps, the osd has to be removed from the crushmap in order 
> to initialize the new drive as the same osd number.
>
> Is this correct that I have to remove it from the crushmap, then after the 
> osd is initialized, and mounted, add it back to the crush map? Is there no 
> way to have it reuse the same osd # without removing if from the crush map?
>
> Thanks for taking the time….
>
>
> -  Jim
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Replacing a failed OSD

2016-09-14 Thread Jim Kilborn
I am finishing testing our new cephfs cluster and wanted to document a failed 
osd procedure.
I noticed that when I pulled a drive, to simulate a failure, and run through 
the replacement steps, the osd has to be removed from the crushmap in order to 
initialize the new drive as the same osd number.

Is this correct that I have to remove it from the crushmap, then after the osd 
is initialized, and mounted, add it back to the crush map? Is there no way to 
have it reuse the same osd # without removing if from the crush map?

Thanks for taking the time….


-  Jim

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] FW: Multiple public networks and ceph-mon daemons listening

2016-09-08 Thread Jim Kilborn
Thanks for the clarification Greg. The private network was a NAT network, but I 
got rid of the NAT, and set the head node just to straight routing. I went 
ahead an set all the daemons to the private network, and its working fine now. 
I was hoping to avoid routing the outside traffic, but no big deal.

I’m new to cephfs and ceph completely, so I’m in that steep learning curve 
phase

Thanks again

Sent from Windows Mail

From: Gregory Farnum<mailto:gfar...@redhat.com>
Sent: ‎Thursday‎, ‎September‎ ‎8‎, ‎2016 ‎6‎:‎05‎ ‎PM
To: Jim Kilborn<mailto:j...@kilborns.com>
Cc: Wido den Hollander<mailto:w...@42on.com>, 
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>

On Thu, Sep 8, 2016 at 7:13 AM, Jim Kilborn <j...@alamois.com> wrote:
> Thanks for the reply.
>
>
>
> When I said the compute nodes mounted the cephfs volume, I am referring to a 
> real linux cluster of physical machines,. Openstack VM/ compute nodes are not 
> involved in my setup. We are transitioning from an older linux cluster using 
> nfs from the head node/san to the new cluster using cephfs. All physical 
> systems mounting the shared volume. Storing home directories and data.
>
>
>
> http://oi63.tinypic.com/2ljp72v.jpg
>
>
>
>
>
> The linux cluster is in a NAT private network, where the only systems 
> attached to the corporate network are the ceph servers and our main linux 
> head node. They are dual connected.
>
> Your saying I cant have ceph volumes mounted and the traffic to the osds 
> coming in on more than one interface? It is limited to one interface?

Well, obviously clients connect to OSDs on the "public" network,
right? The "cluster" network is used by the OSDs for replication. And
as you've noticed, the monitors only use one address, and that needs
to be accessible/routable for everybody.

I presume you *have* a regular IP network on the OSDs that the clients
can route? Otherwise they won't be able to access any data at all. So
I think you just want to set up the monitors and the OSDs on the same
TCP network...

Otherwise there's a bit of a misunderstanding, probably because of the
names. Consider "cluster" network to mean "OSD replication traffic"
and "public" to mean "everything else, including all client IO".
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] FW: Multiple public networks and ceph-mon daemons listening

2016-09-08 Thread Jim Kilborn
Hello all…

I am setting up a ceph cluster (jewel) on a private network. The compute nodes 
are all running centos 7 and mounting the cephfs volume using the kernel 
driver. The ceph storage nodes are dual connected to the private network, as 
well as our corporate network, as some users need to mount the volume to their 
workstations (also centos 7) from the corporate network.

The private network is infiniband, so I have that set as the cluster network , 
and have both networks listed in the private networks in the ceph.conf.

However, the mon daemons only listen on the private network, and if I want to 
mount the volume from the corporate network, it has to mount via the private 
network address of the ceph storage nodes, which means that the cluster head 
node (linux) has to route that traffic.

I would like to know if there is a way to have the monitors listen on both 
their interfaces, like the osd/mds daemons do, so I could use the appropriate 
address in the fstab of the clients, depending on which network they are on.

Alternatively, I could have one of the mon daemons added with its private 
network address, as all ceph storage nodes are dual connected, but I would lose 
some fault tolerance I think (if that monitor goes down)

Just thought there must be a better way. I have 3 monitor nodes (dual 
functioning as osd nodes). There are all brand new dell 730xd with 12GB ram and 
dual xeons. I also have a ssd cache in front of a erasure coded pool.

Any suggestions?

Thanks for taking the time…

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com