Re: [ceph-users] Howto reduce the impact from cephx with small IO

2016-04-20 Thread Udo Lembke

Hi Mark,
thanks for the links.

If I search for wip-auth I found nothing in docs.ceph.com... this mean, 
that wip-auth don't find the way in the ceph code base?!


But I'm wonder about the RHEL7 position at the link 
http://www.spinics.net/lists/ceph-devel/msg22416.html

Unfortunality there are no values for RHEL7 with auth...
But is known on which side (or how many percent) the bottleneck for 
cephx is (client, mon, osd)? My clients (qemu on proxmox-ve) are not 
changeable, but my OSDs can also run on RHEL7/CentOS if this bring an 
performance boost. The Mons are running on the proxmox-ve host yet.


Udo


Am 20.04.2016 um 19:13 schrieb Mark Nelson:

Hi Udo,

There was quite a bit of discussion and some partial improvements to 
cephx performance about a year ago.  You can see some of the 
discussion here:


http://www.spinics.net/lists/ceph-devel/msg3.html

and in particular these tests:

http://www.spinics.net/lists/ceph-devel/msg22416.html

Mark

On 04/20/2016 11:50 AM, Udo Lembke wrote:

Hi,
on an small test-system (3 nodes (mon + osd), 6 OSDs, ceph 0.94.6) I
compare with and without cephx.

I use fio for that inside an VM on an host, outside the 3 ceph-nodes,
with this command:
fio --max-jobs=1 --numjobs=1 --readwrite=read --blocksize=4k --size=4G
--direct=1 --name=fiojob_4k
All test are run three times (after clearing caches) and I take the
average (but the values are very close together).

cephx or not don't matter for an big blocksize of 4M - but for 4k!

If I disable cephx I got:
7040kB/s bandwith
1759IOPS
564µS clat

The same config, but with cephx I see this values:
4265 kB/s bandwith
1066 IOPS
933µS clat

This shows, that the performance drop by 40% with cephx!!

To disable cephx is no alternative, because any system which have access
to the ceph-network can read/write all data...

ceph.conf without cephx:
[global]
  auth_cluster_required = none
  auth_service_required = none
  auth_client_required = none
  cephx_sign_messages = false
  cephx_require_signatures = false
  #
  cluster network =...

ceph.conf with cephx:
[global]
  auth client required = cephx
  auth cluster required = cephx
  auth service required = cephx
  #
  cluster network =...

Is it possible to reduce the cephx impact?
Any hints are welcome.


regards

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Remove incomplete PG

2016-04-20 Thread Tyler Wilson
Hello All,

Are there any documented steps to remove a placement group that is stuck
inactive? I had a situation where we had two nodes go offline and tried
rescuing with https://ceph.com/community/incomplete-pgs-oh-my/ however the
PG remained inactive after importing and starting, now I am just trying to
get the cluster able to read/write again with the remaining placement
groups.



Thanks for any and all assistance!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD image mounted by command "rbd-nbd" the status is read-only.

2016-04-20 Thread Mika c
Hi cephers,
 Read this post "CEPH Jewel Preiew
"
before.
Follow the steps can map and mount rbd image to /dev/nbd successfully.
But I can not write any files. The error message is "Read-only file
system".
​Is this feature still in the experimental stage?


Best wishes,
Mika
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds segfault on cephfs snapshot creation

2016-04-20 Thread Yan, Zheng
On Wed, Apr 20, 2016 at 11:52 PM, Brady Deetz  wrote:
>
>
> On Wed, Apr 20, 2016 at 4:09 AM, Yan, Zheng  wrote:
>>
>> On Wed, Apr 20, 2016 at 12:12 PM, Brady Deetz  wrote:
>> > As soon as I create a snapshot on the root of my test cephfs deployment
>> > with
>> > a single file within the root, my mds server kernel panics. I understand
>> > that snapshots are not recommended. Is it beneficial to developers for
>> > me to
>> > leave my cluster in its present state and provide whatever debugging
>> > information they'd like? I'm not really looking for a solution to a
>> > mission
>> > critical issue as much as providing an opportunity for developers to
>> > pull
>> > stack traces, logs, etc from a system affected by some sort of bug in
>> > cephfs/mds. This happens every time I create a directory inside my .snap
>> > directory.
>>
>> It's likely your kernel is too old for kernel mount. which version of
>> kernel do you use?
>
>
> All nodes in the cluster share the versions listed below. This actually
> appears to be a cephfs client (native) issue (see stacktrace and kernel dump
> below). I have my fs mounted on my mds which is why I thought it was the mds
> causing a panic.
>
> Linux mon0 3.13.0-77-generic #121-Ubuntu SMP Wed Jan 20 10:50:42 UTC 2016
> x86_64 x86_64 x86_64 GNU/Linux

Please use 4.x kernel. Beside ceph-mds 0.80 is too old for using
snapshot. creating snapshot can cause various issues.

Regards
Yan, Zheng

>
> ceph-admin@mon0:~$ cat /etc/issue
> Ubuntu 14.04.4 LTS \n \l
>
> ceph-admin@mon0:~$ dpkg -l | grep ceph | tr -s ' ' | cut -d ' ' -f 2,3
> ceph 0.80.11-0ubuntu1.14.04.1
> ceph-common 0.80.11-0ubuntu1.14.04.1
> ceph-deploy 1.4.0-0ubuntu1
> ceph-fs-common 0.80.11-0ubuntu1.14.04.1
> ceph-mds 0.80.11-0ubuntu1.14.04.1
> libcephfs1 0.80.11-0ubuntu1.14.04.1
> python-ceph 0.80.11-0ubuntu1.14.04.1
>
>
> ceph-admin@mon0:~$ ceph status
> cluster 186408c3-df8a-4e46-a397-a788fc380039
>  health HEALTH_OK
>  monmap e1: 1 mons at {mon0=192.168.1.120:6789/0}, election epoch 1,
> quorum 0 mon0
>  mdsmap e48: 1/1/1 up {0=mon0=up:active}
>  osdmap e206: 15 osds: 15 up, 15 in
>   pgmap v25298: 704 pgs, 5 pools, 123 MB data, 53 objects
> 1648 MB used, 13964 GB / 13965 GB avail
>  704 active+clean
>
>
> ceph-admin@mon0:~$ ceph osd tree
> # idweight  type name   up/down reweight
> -1  13.65   root default
> -2  2.73host osd0
> 0   0.91osd.0   up  1
> 1   0.91osd.1   up  1
> 2   0.91osd.2   up  1
> -3  2.73host osd1
> 3   0.91osd.3   up  1
> 4   0.91osd.4   up  1
> 5   0.91osd.5   up  1
> -4  2.73host osd2
> 6   0.91osd.6   up  1
> 7   0.91osd.7   up  1
> 8   0.91osd.8   up  1
> -5  2.73host osd3
> 9   0.91osd.9   up  1
> 10  0.91osd.10  up  1
> 11  0.91osd.11  up  1
> -6  2.73host osd4
> 12  0.91osd.12  up  1
> 13  0.91osd.13  up  1
> 14  0.91osd.14  up  1
>
>
> http://tech-hell.com/dump.201604201536
>
> [ 5869.157340] [ cut here ]
> [ 5869.157527] kernel BUG at
> /build/linux-faWYrf/linux-3.13.0/fs/ceph/inode.c:928!
> [ 5869.157797] invalid opcode:  [#1] SMP
> [ 5869.157977] Modules linked in: kvm_intel kvm serio_raw ceph libceph
> libcrc32c fscache psmouse floppy
> [ 5869.158415] CPU: 0 PID: 46 Comm: kworker/0:1 Not tainted
> 3.13.0-77-generic #121-Ubuntu
> [ 5869.158709] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
> [ 5869.158925] Workqueue: ceph-msgr con_work [libceph]
> [ 5869.159124] task: 8809abf3c800 ti: 8809abf46000 task.ti:
> 8809abf46000
> [ 5869.159422] RIP: 0010:[]  []
> splice_dentry+0xd5/0x190 [ceph]
> [ 5869.159768] RSP: 0018:8809abf47b68  EFLAGS: 00010282
> [ 5869.159963] RAX: 0004 RBX: 8809a08b2780 RCX:
> 0001
> [ 5869.160224] RDX:  RSI: 8809a04f8370 RDI:
> 8809a08b2780
> [ 5869.160484] RBP: 8809abf47ba8 R08: 8809a982c400 R09:
> 8809a99ef6e8
> [ 5869.160550] R10: 000819d8 R11:  R12:
> 8809a04f8370
> [ 5869.160550] R13: 8809a08b2780 R14: 8809aad5fc00 R15:
> 
> [ 5869.160550] FS:  () GS:8809e3c0()
> knlGS:
> [ 5869.160550] CS:  0010 DS:  ES:  CR0: 8005003b
> [ 5869.160550] CR2: 7f60f37ff5c0 CR3: 0009a5f63000 CR4:
> 06f0
> [ 5869.160550] Stack:
> [ 5869.160550]  8809a5da1000 8809aad5fc00 8809a99ef408
> 8809a99ef400
> [ 5869.160550]  8809a04f8370 8809a08b2780 8809aad5fc00
>

Re: [ceph-users] cephfs does not seem to properly free up space

2016-04-20 Thread Yan, Zheng
to delete these orphan objects

list all objects in cephfs data pool. Object name is in form of [inode
number in dex].[offset in hex].  If an object with 'offset > 0', but
there is no object with 'offset == 0' and same inode number, it's
orphan object.

It's not difficult to write a script to find all orphan objects and
delete them. If there are multiple data pools, repeat above steps for
each data pool.

Regards
Yan, Zheng


On Wed, Apr 20, 2016 at 4:20 PM, Simion Rad  wrote:
> Yes, we do use customized layout settings for most of our folders.
> We have some long running backup jobs which require high-throughput writes in 
> order to finish in a reasonable amount of time.
> 
> From: Florent B 
> Sent: Wednesday, April 20, 2016 11:07
> To: Yan, Zheng; Simion Rad
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] cephfs does not seem to properly free up space
>
> That seems to be the bug we have for years now with CephFS. We always
> used customized layout.
>
> On 04/20/2016 02:20 AM, Yan, Zheng wrote:
>> have you ever used fancy layout?
>>
>> see http://tracker.ceph.com/issues/15050
>>
>>
>> On Wed, Apr 20, 2016 at 3:17 AM, Simion Rad  wrote:
>>> Mounting and unmount doesn't change anyting.
>>> The used space reported by df command is nearly the same  as the values 
>>> returned by ceph -s command.
>>>
>>> Example 1, df output:
>>> ceph-fuse   334T  134T  200T  41% /cephfs
>>>
>>> Example 2, ceph -s output:
>>>  health HEALTH_WARN
>>> mds0: Many clients (22) failing to respond to cache pressure
>>> noscrub,nodeep-scrub,sortbitwise flag(s) set
>>>  monmap e1: 5 mons at 
>>> {r730-12=10.103.213.12:6789/0,r730-4=10.103.213.4:6789/0,r730-5=
>>> 10.103.213.5:6789/0,r730-8=10.103.213.8:6789/0,r730-9=10.103.213.9:6789/0}
>>> election epoch 132, quorum 0,1,2,3,4 
>>> r730-4,r730-5,r730-8,r730-9,r730-12
>>>  mdsmap e14637: 1/1/1 up {0=ceph2-mds-2=up:active}
>>>  osdmap e6549: 68 osds: 68 up, 68 in
>>> flags noscrub,nodeep-scrub,sortbitwise
>>>   pgmap v4394151: 896 pgs, 3 pools, 54569 GB data, 56582 kobjects
>>> 133 TB used, 199 TB / 333 TB avail
>>>  896 active+clean
>>>   client io 47395 B/s rd, 1979 kB/s wr, 388 op/s
>>>
>>>
>>> 
>>> From: John Spray 
>>> Sent: Tuesday, April 19, 2016 22:04
>>> To: Simion Rad
>>> Cc: ceph-users@lists.ceph.com
>>> Subject: Re: [ceph-users] cephfs does not seem to properly free up space
>>>
>>> On Tue, Apr 19, 2016 at 2:40 PM, Simion Rad  wrote:
 Hello,


 At my workplace we have a production cephfs cluster (334 TB on 60 OSDs)
 which was recently upgraded from Infernalis 9.2.0 to Infernalis 9.2.1 on
 Ubuntu 14.04.3 (linux 3.19.0-33).

 It seems that cephfs still doesn't free up space at all or at least that's
 what df command tells us.
>>> Hmm, historically there were bugs with the purging code, but I thought
>>> we fixed them before Infernalis.
>>>
>>> Does the space get freed after you unmount the client?  Some issues
>>> have involved clients holding onto references to unlinked inodes.
>>>
>>> John
>>>
 Is there a better way of getting a df-like output with other command for
 cephfs  ?


 Thank you,

 Marius Rad

 SysAdmin

 www.propertyshark.com


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Howto reduce the impact from cephx with small IO

2016-04-20 Thread Mark Nelson

Hi Udo,

There was quite a bit of discussion and some partial improvements to 
cephx performance about a year ago.  You can see some of the discussion 
here:


http://www.spinics.net/lists/ceph-devel/msg3.html

and in particular these tests:

http://www.spinics.net/lists/ceph-devel/msg22416.html

Mark

On 04/20/2016 11:50 AM, Udo Lembke wrote:

Hi,
on an small test-system (3 nodes (mon + osd), 6 OSDs, ceph 0.94.6) I
compare with and without cephx.

I use fio for that inside an VM on an host, outside the 3 ceph-nodes,
with this command:
fio --max-jobs=1 --numjobs=1 --readwrite=read --blocksize=4k --size=4G
--direct=1 --name=fiojob_4k
All test are run three times (after clearing caches) and I take the
average (but the values are very close together).

cephx or not don't matter for an big blocksize of 4M - but for 4k!

If I disable cephx I got:
7040kB/s bandwith
1759IOPS
564µS clat

The same config, but with cephx I see this values:
4265 kB/s bandwith
1066 IOPS
933µS clat

This shows, that the performance drop by 40% with cephx!!

To disable cephx is no alternative, because any system which have access
to the ceph-network can read/write all data...

ceph.conf without cephx:
[global]
  auth_cluster_required = none
  auth_service_required = none
  auth_client_required = none
  cephx_sign_messages = false
  cephx_require_signatures = false
  #
  cluster network =...

ceph.conf with cephx:
[global]
  auth client required = cephx
  auth cluster required = cephx
  auth service required = cephx
  #
  cluster network =...

Is it possible to reduce the cephx impact?
Any hints are welcome.


regards

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Howto reduce the impact from cephx with small IO

2016-04-20 Thread Udo Lembke

Hi,
on an small test-system (3 nodes (mon + osd), 6 OSDs, ceph 0.94.6) I 
compare with and without cephx.


I use fio for that inside an VM on an host, outside the 3 ceph-nodes, 
with this command:
fio --max-jobs=1 --numjobs=1 --readwrite=read --blocksize=4k --size=4G 
--direct=1 --name=fiojob_4k
All test are run three times (after clearing caches) and I take the 
average (but the values are very close together).


cephx or not don't matter for an big blocksize of 4M - but for 4k!

If I disable cephx I got:
7040kB/s bandwith
1759IOPS
564µS clat

The same config, but with cephx I see this values:
4265 kB/s bandwith
1066 IOPS
933µS clat

This shows, that the performance drop by 40% with cephx!!

To disable cephx is no alternative, because any system which have access 
to the ceph-network can read/write all data...


ceph.conf without cephx:
[global]
 auth_cluster_required = none
 auth_service_required = none
 auth_client_required = none
 cephx_sign_messages = false
 cephx_require_signatures = false
 #
 cluster network =...

ceph.conf with cephx:
[global]
 auth client required = cephx
 auth cluster required = cephx
 auth service required = cephx
 #
 cluster network =...

Is it possible to reduce the cephx impact?
Any hints are welcome.


regards

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds segfault on cephfs snapshot creation

2016-04-20 Thread Brady Deetz
On Wed, Apr 20, 2016 at 4:09 AM, Yan, Zheng  wrote:

> On Wed, Apr 20, 2016 at 12:12 PM, Brady Deetz  wrote:
> > As soon as I create a snapshot on the root of my test cephfs deployment
> with
> > a single file within the root, my mds server kernel panics. I understand
> > that snapshots are not recommended. Is it beneficial to developers for
> me to
> > leave my cluster in its present state and provide whatever debugging
> > information they'd like? I'm not really looking for a solution to a
> mission
> > critical issue as much as providing an opportunity for developers to pull
> > stack traces, logs, etc from a system affected by some sort of bug in
> > cephfs/mds. This happens every time I create a directory inside my .snap
> > directory.
>
> It's likely your kernel is too old for kernel mount. which version of
> kernel do you use?
>

All nodes in the cluster share the versions listed below. This actually
appears to be a cephfs client (native) issue (see stacktrace and kernel
dump below). I have my fs mounted on my mds which is why I thought it was
the mds causing a panic.

Linux mon0 3.13.0-77-generic #121-Ubuntu SMP Wed Jan 20 10:50:42 UTC 2016
x86_64 x86_64 x86_64 GNU/Linux

ceph-admin@mon0:~$ cat /etc/issue
Ubuntu 14.04.4 LTS \n \l

ceph-admin@mon0:~$ dpkg -l | grep ceph | tr -s ' ' | cut -d ' ' -f 2,3
ceph 0.80.11-0ubuntu1.14.04.1
ceph-common 0.80.11-0ubuntu1.14.04.1
ceph-deploy 1.4.0-0ubuntu1
ceph-fs-common 0.80.11-0ubuntu1.14.04.1
ceph-mds 0.80.11-0ubuntu1.14.04.1
libcephfs1 0.80.11-0ubuntu1.14.04.1
python-ceph 0.80.11-0ubuntu1.14.04.1


ceph-admin@mon0:~$ ceph status
cluster 186408c3-df8a-4e46-a397-a788fc380039
 health HEALTH_OK
 monmap e1: 1 mons at {mon0=192.168.1.120:6789/0}, election epoch 1,
quorum 0 mon0
 mdsmap e48: 1/1/1 up {0=mon0=up:active}
 osdmap e206: 15 osds: 15 up, 15 in
  pgmap v25298: 704 pgs, 5 pools, 123 MB data, 53 objects
1648 MB used, 13964 GB / 13965 GB avail
 704 active+clean


ceph-admin@mon0:~$ ceph osd tree
# idweight  type name   up/down reweight
-1  13.65   root default
-2  2.73host osd0
0   0.91osd.0   up  1
1   0.91osd.1   up  1
2   0.91osd.2   up  1
-3  2.73host osd1
3   0.91osd.3   up  1
4   0.91osd.4   up  1
5   0.91osd.5   up  1
-4  2.73host osd2
6   0.91osd.6   up  1
7   0.91osd.7   up  1
8   0.91osd.8   up  1
-5  2.73host osd3
9   0.91osd.9   up  1
10  0.91osd.10  up  1
11  0.91osd.11  up  1
-6  2.73host osd4
12  0.91osd.12  up  1
13  0.91osd.13  up  1
14  0.91osd.14  up  1


http://tech-hell.com/dump.201604201536

[ 5869.157340] [ cut here ]
[ 5869.157527] kernel BUG at
/build/linux-faWYrf/linux-3.13.0/fs/ceph/inode.c:928!
[ 5869.157797] invalid opcode:  [#1] SMP
[ 5869.157977] Modules linked in: kvm_intel kvm serio_raw ceph libceph
libcrc32c fscache psmouse floppy
[ 5869.158415] CPU: 0 PID: 46 Comm: kworker/0:1 Not tainted
3.13.0-77-generic #121-Ubuntu
[ 5869.158709] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[ 5869.158925] Workqueue: ceph-msgr con_work [libceph]
[ 5869.159124] task: 8809abf3c800 ti: 8809abf46000 task.ti:
8809abf46000
[ 5869.159422] RIP: 0010:[]  []
splice_dentry+0xd5/0x190 [ceph]
[ 5869.159768] RSP: 0018:8809abf47b68  EFLAGS: 00010282
[ 5869.159963] RAX: 0004 RBX: 8809a08b2780 RCX:
0001
[ 5869.160224] RDX:  RSI: 8809a04f8370 RDI:
8809a08b2780
[ 5869.160484] RBP: 8809abf47ba8 R08: 8809a982c400 R09:
8809a99ef6e8
[ 5869.160550] R10: 000819d8 R11:  R12:
8809a04f8370
[ 5869.160550] R13: 8809a08b2780 R14: 8809aad5fc00 R15:

[ 5869.160550] FS:  () GS:8809e3c0()
knlGS:
[ 5869.160550] CS:  0010 DS:  ES:  CR0: 8005003b
[ 5869.160550] CR2: 7f60f37ff5c0 CR3: 0009a5f63000 CR4:
06f0
[ 5869.160550] Stack:
[ 5869.160550]  8809a5da1000 8809aad5fc00 8809a99ef408
8809a99ef400
[ 5869.160550]  8809a04f8370 8809a08b2780 8809aad5fc00

[ 5869.160550]  8809abf47c08 a00a0dc7 8809a982c544
8809ab3f5400
[ 5869.160550] Call Trace:
[ 5869.160550]  [] ceph_fill_trace+0x2a7/0x770 [ceph]
[ 5869.160550]  [] handle_reply+0x3d5/0xc70 [ceph]
[ 5869.160550]  [] dispatch+0xe7/0xa90 [ceph]
[ 5869.160550]  [] ? ceph_tcp_recvmsg+0x48/0x60 [libceph]
[ 5869.160550]  [] try_read+0x4ab/0x10d0 [libceph]
[ 5869.160550]  [] ? k

Re: [ceph-users] Slow read on RBD mount, Hammer 0.94.5

2016-04-20 Thread Nick Fisk


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Udo Lembke
> Sent: 20 April 2016 07:21
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Slow read on RBD mount, Hammer 0.94.5
> 
> Hi Mike,
> I don't have experiences with RBD mounts, but see the same effect with
> RBD.
> 
> You can do some tuning to get better results (disable debug and so on).
> 
> As hint some values from a ceph.conf:
> [osd]
>  debug asok = 0/0
>  debug auth = 0/0
>  debug buffer = 0/0
>  debug client = 0/0
>  debug context = 0/0
>  debug crush = 0/0
>  debug filer = 0/0
>  debug filestore = 0/0
>  debug finisher = 0/0
>  debug heartbeatmap = 0/0
>  debug journal = 0/0
>  debug journaler = 0/0
>  debug lockdep = 0/0
>  debug mds = 0/0
>  debug mds balancer = 0/0
>  debug mds locker = 0/0
>  debug mds log = 0/0
>  debug mds log expire = 0/0
>  debug mds migrator = 0/0
>  debug mon = 0/0
>  debug monc = 0/0
>  debug ms = 0/0
>  debug objclass = 0/0
>  debug objectcacher = 0/0
>  debug objecter = 0/0
>  debug optracker = 0/0
>  debug osd = 0/0
>  debug paxos = 0/0
>  debug perfcounter = 0/0
>  debug rados = 0/0
>  debug rbd = 0/0
>  debug rgw = 0/0
>  debug throttle = 0/0
>  debug timer = 0/0
>  debug tp = 0/0
>  filestore_op_threads = 4
>  osd max backfills = 1
>  osd mount options xfs =
> "rw,noatime,inode64,logbufs=8,logbsize=256k,allocsize=4M"
>  osd mkfs options xfs = "-f -i size=2048"
>  osd recovery max active = 1
>  osd_disk_thread_ioprio_class = idle
>  osd_disk_thread_ioprio_priority = 7
>  osd_disk_threads = 1
>  osd_enable_op_tracker = false
>  osd_op_num_shards = 10
>  osd_op_num_threads_per_shard = 1
>  osd_op_threads = 4
> 
> Udo
> 
> On 19.04.2016 11:21, Mike Miller wrote:
> > Hi,
> >
> > RBD mount
> > ceph v0.94.5
> > 6 OSD with 9 HDD each
> > 10 GBit/s public and private networks
> > 3 MON nodes 1Gbit/s network
> >
> > A rbd mounted with btrfs filesystem format performs really badly when
> > reading. Tried readahead in all combinations but that does not help in
> > any way.
> >
> > Write rates are very good in excess of 600 MB/s up to 1200 MB/s,
> > average 800 MB/s Read rates on the same mounted rbd are about 10-30
> > MB/s !?

What kernel are you running, older kernels had an issue where readahead was
capped at 2MB. In order to get good read speeds you need readahead set to
about 32MB+.


> >
> > Of course, both writes and reads are from a single client machine with
> > a single write/read command. So I am looking at single threaded
> > performance.
> > Actually, I was hoping to see at least 200-300 MB/s when reading, but
> > I am seeing 10% of that at best.
> >
> > Thanks for your help.
> >
> > Mike
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Monitor not starting: Corruption: 12 missing files

2016-04-20 Thread Daniel.Balsiger
Dear Ceph Users,

I have the following situation in my small 3-node cluster:

--snip
root@ceph2:~# ceph status
cluster d1af2097-8535-42f2-ba8c-0667f90cab61
 health HEALTH_WARN
1 mons down, quorum 0,1 ceph0,ceph1
 monmap e1: 3 mons at 
{ceph0=10.0.0.30:6789/0,ceph1=10.0.0.31:6789/0,ceph2=10.0.0.32:6789/0}
election epoch 538, quorum 0,1 ceph0,ceph1
 osdmap e2574: 9 osds: 9 up, 9 in
  pgmap v4397840: 768 pgs, 3 pools, 112 GB data, 28339 objects
347 GB used, 8018 GB / 8365 GB avail
 768 active+clean
  client io 0 B/s rd, 3283 B/s wr, 2 op/s
--snip

Seems a monitor couldn't start, so I tried to start it on console:

--snip
root@ceph2:~#  /usr/bin/ceph-mon --cluster=ceph -i ceph2 -f
Corruption: 12 missing files; e.g.: 
/var/lib/ceph/mon/ceph-ceph2/store.db/811920.ldb
2016-04-20 13:16:49.019857 7f39a9cbe800 -1 error opening mon data directory at 
'/var/lib/ceph/mon/ceph-ceph2': (22) Invalid argument
--snip

Seems there are indeed some files missing in the monitor directory:

--snip
root@ceph2:~# find /var/lib/ceph/mon/ceph-ceph2/store.db
/var/lib/ceph/mon/ceph-ceph2/store.db
/var/lib/ceph/mon/ceph-ceph2/store.db/LOG
/var/lib/ceph/mon/ceph-ceph2/store.db/811944.ldb
/var/lib/ceph/mon/ceph-ceph2/store.db/811943.ldb
/var/lib/ceph/mon/ceph-ceph2/store.db/811945.ldb
/var/lib/ceph/mon/ceph-ceph2/store.db/811939.log
/var/lib/ceph/mon/ceph-ceph2/store.db/811936.ldb
/var/lib/ceph/mon/ceph-ceph2/store.db/811952.ldb
/var/lib/ceph/mon/ceph-ceph2/store.db/811947.ldb
/var/lib/ceph/mon/ceph-ceph2/store.db/811946.ldb
/var/lib/ceph/mon/ceph-ceph2/store.db/MANIFEST-753605
/var/lib/ceph/mon/ceph-ceph2/store.db/811935.ldb
/var/lib/ceph/mon/ceph-ceph2/store.db/811942.ldb
/var/lib/ceph/mon/ceph-ceph2/store.db/811951.ldb
/var/lib/ceph/mon/ceph-ceph2/store.db/811949.ldb
/var/lib/ceph/mon/ceph-ceph2/store.db/811948.ldb
/var/lib/ceph/mon/ceph-ceph2/store.db/CURRENT
/var/lib/ceph/mon/ceph-ceph2/store.db/LOCK
/var/lib/ceph/mon/ceph-ceph2/store.db/LOG.old
/var/lib/ceph/mon/ceph-ceph2/store.db/811950.ldb
--snip

What I should/have to do to get that monitor up and working again ?

Thanks in advance any help is appreciated.

Kind Regards, Daniel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] EC Jerasure plugin and StreamScale Inc

2016-04-20 Thread Chandan Kumar Singh
Hi

What does the ceph community think of StreamScale's claims on
Jerasure? Is it possible to use the EC plugin for commercial purposes?
What is your advice?

Regards
Chandan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cache tier clean rate too low

2016-04-20 Thread Nick Fisk
I would advise you to take a look at the osd_agent_max_ops (and 
osd_agent_max_ops), these should in theory dictate how many parallel threads 
will be used for flushing. Do a conf dump from the admin socket to see what you 
are currently running with and then bump them up to see if it helps.

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Josef Johansson
> Sent: 20 April 2016 06:57
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] ceph cache tier clean rate too low
> 
> Hi,
> response in line
> On 20 Apr 2016 7:45 a.m., "Christian Balzer"  wrote:
> >
> >
> > Hello,
> >
> > On Wed, 20 Apr 2016 03:42:00 + Stephen Lord wrote:
> >
> > >
> > > OK, you asked ;-)
> > >
> >
> > I certainly did. ^o^
> >
> > > This is all via RBD, I am running a single filesystem on top of 8 RBD
> > > devices in an effort to get data striping across more OSDs, I had been
> > > using that setup before adding the cache tier.
> > >
> > Nods.
> > Depending on your use case (sequential writes) actual RADOS striping
> might
> > be more advantageous than this (with 4MB writes still going to the same
> > PG/OSD all the time).
> >
> >
> > > 3 nodes with 11 6 Tbyte SATA drives each for a base RBD pool, this is
> > > setup with replication size 3. No SSDs involved in those OSDs, since
> > > ceph-disk does not let you break a bluestore configuration into more
> > > than one device at the moment.
> > >
> > That's a pity, but supposedly just  a limitation of ceph-disk.
> > I'd venture you can work around that with symlinks to a raw SSD
> > partition, same as with current filestore journals.
> >
> > As Sage recently wrote:
> > ---
> > BlueStore can use as many as three devices: one for the WAL (journal,
> > though it can be much smaller than FileStores, e.g., 128MB), one for
> > metadata (e.g., an SSD partition), and one for data.
> > ---
> I believe he also mentioned the use of bcache and friends for the osd,
> maybe a way forward in this case?
> Regards
> Josef
> >
> > > The 600 Mbytes/sec is an approx sustained number for the data rate I can
> > > get going into this pool via RBD, that turns into 3 times that for raw
> > > data rate, so at 33 drives that is mid 50s Mbytes/sec per drive. I have
> > > pushed it harder than that from time to time, but the OSD really wants
> > > to use fdatasync a lot and that tends to suck up a lot of the potential
> > > of a device, these disks will do 160 Mbytes/sec if you stream data to
> > > them.
> > >
> > > I just checked with rados bench to this set of 33 OSDs with a 3 replica
> > > pool, and 600 Mbytes/sec is what it will do from the same client host.
> > >
> > This matches a cluster of mine with 32 OSDs (filestore of course) and SSD
> > journals on 4 nodes with a replica of 3.
> >
> > So BlueStore is indeed faster than than filestore.
> >
> > > All the networking is 40 GB ethernet, single port per host, generally I
> > > can push 2.2 Gbytes/sec in one direction between two hosts over a single
> > > tcp link, the max I have seen is about 2.7 Gbytes/sec coming into a
> > > node. Short of going to RDMA that appears to be about the limit for
> > > these processors.
> > >
> > Yeah, didn't expect your network to be involved here bottleneck wise, but
> > a good data point to have nevertheless.
> >
> > > There are a grand total of 2 400 GB P3700s which are running a pool with
> > > a replication factor of 1, these are in 2 other nodes. Once I add in
> > > replication perf goes downhill. If I had more hardware I would be
> > > running more of these and using replication, but I am out of network
> > > cards right now.
> > >
> > Alright, so at 900MB/s you're pretty close to what one would expect from 2
> > of these: 1080MB/s*2/2(journal).
> >
> > How much downhill is that?
> >
> > I have a production cache tier with 2 nodes (replica 2 of course) and 4
> > 800GB DC S3610s each, IPoIB QDR (40Gbs) interconnect and the
> performance
> > is pretty much what I would expect.
> >
> > > So 5 nodes running OSDs, and a 6th node running the RBD client using the
> > > kernel implementation.
> > >
> > I assume there's are reason for use the kernel RBD client (which kernel?),
> > given that it tends to be behind the curve in terms of features and speed?
> >
> > > Complete set of commands for creating the cache tier, I pulled this from
> > > history, so the line in the middle was a failed command actually so
> > > sorry for the red herring.
> > >
> > >   982  ceph osd pool create nvme 512 512 replicated_nvme
> > >   983  ceph osd pool set nvme size 1
> > >   984  ceph osd tier add rbd nvme
> > >   985  ceph osd tier cache-mode  nvme writeback
> > >   986  ceph osd tier set-overlay rbd nvme
> > >   987  ceph osd pool set nvme  hit_set_type bloom
> > >   988  ceph osd pool set target_max_bytes 5000 <<—— typo
> here,
> > > so never mind 989  ceph osd pool set nvme target_max_bytes
> 5000
> > >   990  ceph osd pool set nvme target_max_objects 5

Re: [ceph-users] mds segfault on cephfs snapshot creation

2016-04-20 Thread Yan, Zheng
On Wed, Apr 20, 2016 at 12:12 PM, Brady Deetz  wrote:
> As soon as I create a snapshot on the root of my test cephfs deployment with
> a single file within the root, my mds server kernel panics. I understand
> that snapshots are not recommended. Is it beneficial to developers for me to
> leave my cluster in its present state and provide whatever debugging
> information they'd like? I'm not really looking for a solution to a mission
> critical issue as much as providing an opportunity for developers to pull
> stack traces, logs, etc from a system affected by some sort of bug in
> cephfs/mds. This happens every time I create a directory inside my .snap
> directory.

It's likely your kernel is too old for kernel mount. which version of
kernel do you use?



>
> Let me know if I should blow my cluster away?
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs does not seem to properly free up space

2016-04-20 Thread Simion Rad
Yes, we do use customized layout settings for most of our folders. 
We have some long running backup jobs which require high-throughput writes in 
order to finish in a reasonable amount of time.

From: Florent B 
Sent: Wednesday, April 20, 2016 11:07
To: Yan, Zheng; Simion Rad
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] cephfs does not seem to properly free up space

That seems to be the bug we have for years now with CephFS. We always
used customized layout.

On 04/20/2016 02:20 AM, Yan, Zheng wrote:
> have you ever used fancy layout?
>
> see http://tracker.ceph.com/issues/15050
>
>
> On Wed, Apr 20, 2016 at 3:17 AM, Simion Rad  wrote:
>> Mounting and unmount doesn't change anyting.
>> The used space reported by df command is nearly the same  as the values 
>> returned by ceph -s command.
>>
>> Example 1, df output:
>> ceph-fuse   334T  134T  200T  41% /cephfs
>>
>> Example 2, ceph -s output:
>>  health HEALTH_WARN
>> mds0: Many clients (22) failing to respond to cache pressure
>> noscrub,nodeep-scrub,sortbitwise flag(s) set
>>  monmap e1: 5 mons at 
>> {r730-12=10.103.213.12:6789/0,r730-4=10.103.213.4:6789/0,r730-5=
>> 10.103.213.5:6789/0,r730-8=10.103.213.8:6789/0,r730-9=10.103.213.9:6789/0}
>> election epoch 132, quorum 0,1,2,3,4 
>> r730-4,r730-5,r730-8,r730-9,r730-12
>>  mdsmap e14637: 1/1/1 up {0=ceph2-mds-2=up:active}
>>  osdmap e6549: 68 osds: 68 up, 68 in
>> flags noscrub,nodeep-scrub,sortbitwise
>>   pgmap v4394151: 896 pgs, 3 pools, 54569 GB data, 56582 kobjects
>> 133 TB used, 199 TB / 333 TB avail
>>  896 active+clean
>>   client io 47395 B/s rd, 1979 kB/s wr, 388 op/s
>>
>>
>> 
>> From: John Spray 
>> Sent: Tuesday, April 19, 2016 22:04
>> To: Simion Rad
>> Cc: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] cephfs does not seem to properly free up space
>>
>> On Tue, Apr 19, 2016 at 2:40 PM, Simion Rad  wrote:
>>> Hello,
>>>
>>>
>>> At my workplace we have a production cephfs cluster (334 TB on 60 OSDs)
>>> which was recently upgraded from Infernalis 9.2.0 to Infernalis 9.2.1 on
>>> Ubuntu 14.04.3 (linux 3.19.0-33).
>>>
>>> It seems that cephfs still doesn't free up space at all or at least that's
>>> what df command tells us.
>> Hmm, historically there were bugs with the purging code, but I thought
>> we fixed them before Infernalis.
>>
>> Does the space get freed after you unmount the client?  Some issues
>> have involved clients holding onto references to unlinked inodes.
>>
>> John
>>
>>> Is there a better way of getting a df-like output with other command for
>>> cephfs  ?
>>>
>>>
>>> Thank you,
>>>
>>> Marius Rad
>>>
>>> SysAdmin
>>>
>>> www.propertyshark.com
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Multiple OSD crashing a lot

2016-04-20 Thread Blade Doyle
I get a lot of osd crash with the following stack - suggestion please:

 0> 1969-12-31 16:04:55.455688 83ccf410 -1 osd/ReplicatedPG.cc: In
function 'void ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*,
unsigned int)' thread 83ccf410 time 295.324905
osd/ReplicatedPG.cc: 11011: FAILED assert(obc)

 ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
 1: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned
int)+0x3f9) [0xb6c625e6]
 2: (ReplicatedPG::hit_set_persist()+0x8bf) [0xb6c62fb4]
 3: (ReplicatedPG::do_op(std::tr1::shared_ptr)+0xc97)
[0xb6c6eb2c]
 4: (ReplicatedPG::do_request(std::tr1::shared_ptr,
ThreadPool::TPHandle&)+0x439) [0xb6c2f01a]
 5: (OSD::dequeue_op(boost::intrusive_ptr,
std::tr1::shared_ptr, ThreadPool::TPHandle&)+0x22b) [0xb6b0b984]
 6: (OSD::OpWQ::_process(boost::intrusive_ptr,
ThreadPool::TPHandle&)+0x13d) [0xb6b1ccf6]
 7: (ThreadPool::WorkQueueVal,
std::tr1::shared_ptr >, boost::intrusive_ptr
>::_void_process(void*, ThreadPool::TPHandle&)+0x6b) [0xb6b4692c]
 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb93) [0xb6e152bc]
 9: (ThreadPool::WorkThread::entry()+0x9) [0xb6e15aea]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.


Blade.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Build Raw Volume from Recovered RBD Objects

2016-04-20 Thread Wido den Hollander

> Op 19 april 2016 om 19:15 schreef Mike Dawson :
> 
> 
> All,
> 
> I was called in to assist in a failed Ceph environment with the cluster 
> in an inoperable state. No rbd volumes are mountable/exportable due to 
> missing PGs.
> 
> The previous operator was using a replica count of 2. The cluster 
> suffered a power outage and various non-catastrophic hardware issues as 
> they were starting it back up. At some point during recovery, drives 
> were removed from the cluster leaving several PGs missing.
> 
> Efforts to restore the missing PGs from the data on the removed drives 
> failed using the process detailed in a Red Hat Customer Support blog 
> post [0]. Upon starting the OSDs with recovered PGs, a segfault halts 
> progress. The original operator isn't clear on when, but there may have 
> been a software upgrade applied after the drives were pulled.
> 
> I believe the cluster may be irrecoverable at this point.
> 

That's not good to hear!

> My recovery assistance has focused on a plan to:
> 
> 1) Scrape all objects for several key rbd volumes from live OSDs and the 
> removed former OSD drives.
> 
> 2) Compare and deduplicate the two copies of each object.
> 
> 3) Recombine the objects for each volume into a raw image.
> 
> I have completed steps 1 and 2 with apparent success. My initial stab at 
> step 3 yielded a raw image that could be mounted and had signs of a 
> filesystem, but it could not be read. Could anyone assist me with the 
> following questions?
> 
> 1) Are the rbd objects in order by filename? If not, what is the method 
> to determine their order?
> 

You might want to try my blogpost: 
http://blog.widodh.nl/2014/04/calculating-rados-objects-for-rbd-image/

> 2) How should objects smaller than the default 4MB chunk size be 
> handled? Should they be padded somehow?
> 

Yes, with zeroes. But it depends on the offset. I don't know that for sure.

> 3) If any objects were completely missing and therefore unavailable to 
> this process, how should they be handled? I assume we need to offset/pad 
> to compensate.

If they are missing just add 4MB of zeroes.

You might want to try importing the RBD objects into a fresh RBD cluster using 
the RADOS API.

Just make sure you have a RBD header object with the proper object prefix and 
size in there. Through librbd you then might be able to recover the data.

This way you use the RBD logic instead of having to script it yourself.

Good luck!

Wido

> -- 
> Thanks,
> 
> Mike Dawson
> Co-Founder & Director of Cloud Architecture
> Cloudapt LLC
> 6330 East 75th Street, Suite 170
> Indianapolis, IN 46250
> M: 317-490-3018
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com