Re: [ceph-users] CephFS Bug found with CentOS 7.2

2016-06-16 Thread Oliver Dzombic
Hi,

just to verify this:

no symlink usage == no problem/bug

right ?

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 17.06.2016 um 06:11 schrieb Yan, Zheng:
> On Fri, Jun 17, 2016 at 5:03 AM, Jason Gress  wrote:
>> This is the latest default kernel with CentOS7.  We also tried a newer
>> kernel (from elrepo), a 4.4 that has the same problem, so I don't think
>> that is it.  Thank you for the suggestion though.
>>
>> We upgraded our cluster to the 10.2.2 release today, and it didn't resolve
>> all of the issues.  It's possible that a related issue is actually
>> permissions.  Something may not be right with our config (or a bug) here.
>>
>> While testing we noticed that there may actually be two issues here.  I am
>> unsure, as we noticed that the most consistent way to reproduce our issue
>> is to use vim or sed -i which does in place renames:
>>
>> [root@ftp01 cron]# ls -la
>> total 3
>> drwx--   1 root root 2044 Jun 16 15:50 .
>> drwxr-xr-x. 10 root root  104 May 19 09:34 ..
>> -rw-r--r--   1 root root  300 Jun 16 15:50 file
>> -rw---   1 root root 2044 Jun 16 13:47 root
>> [root@ftp01 cron]# sed -i 's/^/#/' file
>> sed: cannot rename ./sedfB2CkO: Permission denied
>>
>>
>> Strangely, adding or deleting files works fine, it's only renaming that
>> fails.  And strangely I was able to successfully edit the file on ftp02:
>>
>> [root@ftp02 cron]# sed -i 's/^/#/' file
>> [root@ftp02 cron]# ls -la
>> total 3
>> drwx--   1 root root 2044 Jun 16 15:49 .
>> drwxr-xr-x. 10 root root  104 May 19 09:34 ..
>> -rw-r--r--   1 root root  313 Jun 16 15:49 file
>> -rw---   1 root root 2044 Jun 16 13:47 root
>>
>>
>> Then it worked on ftp01 this time:
>> [root@ftp01 cron]# ls -la
>> total 3
>> drwx--   1 root root 2357 Jun 16 15:49 .
>> drwxr-xr-x. 10 root root  104 May 19 09:34 ..
>> -rw-r--r--   1 root root  313 Jun 16 15:49 file
>> -rw---   1 root root 2044 Jun 16 13:47 root
>>
>>
>> Then, I vim'd it successfully on ftp01... Then ran the sed again:
>>
>> [root@ftp01 cron]# sed -i 's/^/#/' file
>> sed: cannot rename ./sedfB2CkO: Permission denied
>> [root@ftp01 cron]# ls -la
>> total 3
>> drwx--   1 root root 2044 Jun 16 15:51 .
>> drwxr-xr-x. 10 root root  104 May 19 09:34 ..
>> -rw-r--r--   1 root root  300 Jun 16 15:50 file
>> -rw---   1 root root 2044 Jun 16 13:47 root
>>
>>
>> And now we have the zero file problem again:
>>
>> [root@ftp02 cron]# ls -la
>> total 2
>> drwx--   1 root root 2044 Jun 16 15:51 .
>> drwxr-xr-x. 10 root root  104 May 19 09:34 ..
>> -rw-r--r--   1 root root0 Jun 16 15:50 file
>> -rw---   1 root root 2044 Jun 16 13:47 root
>>
>>
>> Anyway, I wonder how much of this issue is related to that cannot rename
>> issue above.  Here are our security settings:
>>
>> client.ftp01
>> key: 
>> caps: [mds] allow r, allow rw path=/ftp
>> caps: [mon] allow r
>> caps: [osd] allow rw pool=cephfs_metadata, allow rw pool=cephfs_data
>> client.ftp02
>> key: 
>> caps: [mds] allow r, allow rw path=/ftp
>> caps: [mon] allow r
>> caps: [osd] allow rw pool=cephfs_metadata, allow rw pool=cephfs_data
>>
>>
>> /ftp is the directory on cephfs under which cron lives; the full path is
>> /ftp/cron .
>>
>> I hope this helps and thank you for your time!
> 
> I opened  ticket http://tracker.ceph.com/issues/16358. The bug is in
> path restriction code. For now, the workaround is updating client caps
> to not use path restriction.
> 
> Regards
> Yan, Zheng
> 
>>
>> Jason
>>
>> On 6/15/16, 4:43 PM, "John Spray"  wrote:
>>
>>> On Wed, Jun 15, 2016 at 10:21 PM, Jason Gress 
>>> wrote:
 While trying to use CephFS as a clustered filesystem, we stumbled upon a
 reproducible bug that is unfortunately pretty serious, as it leads to
 data
 loss.  Here is the situation:

 We have two systems, named ftp01 and ftp02.  They are both running
 CentOS
 7.2, with this kernel release and ceph packages:

 kernel-3.10.0-327.18.2.el7.x86_64
>>>
>>> That is an old-ish kernel to be using with cephfs.  It may well be the
>>> source of your issues.
>>>
 [root@ftp01 cron]# rpm -qa | grep ceph
 ceph-base-10.2.1-0.el7.x86_64
 ceph-deploy-1.5.33-0.noarch
 ceph-mon-10.2.1-0.el7.x86_64
 libcephfs1-10.2.1-0.el7.x86_64
 ceph-selinux-10.2.1-0.el7.x86_64
 ceph-mds-10.2.1-0.el7.x86_64
 ceph-common-10.2.1-0.el7.x86_64
 ceph-10.2.1-0.el7.x86_64
 python-cephfs-10.2.1-0.el7.x86_64
 ceph-osd-10.2.1-0.el7.x86_64

 Mounted like so:
 XX.XX.XX.XX:/ftp/cron /var/spool/cron ceph
 _netdev,relatime,name=ftp01,secretfile=/etc/ceph/ftp01.secret 0 0
 And:
 XX.XX.XX.XX:/ftp/cron /var/spool/cron ceph

Re: [ceph-users] Bluestore RAM usage/utilization

2016-06-16 Thread Christian Balzer

Hello Adam,

On Thu, 16 Jun 2016 23:40:26 -0500 Adam Tygart wrote:

> According to Sage[1] Bluestore makes use of the pagecache. I don't
> believe read-ahead is a filesystem tunable in Linux, it is set on the
> block device itself, therefore read-ahead shouldn't be an issue.
> 
Thank's for that link, that's very welcome news.
So all that RAM is not going to waste, the equivalent to dir-entries and
inodes I guess is in the RocksDB, so that being able to grow accordingly
in RAM would be a good thing, too.

As for read-ahead, take a peek at these:

https://www.mail-archive.com/ceph-users@lists.ceph.com/msg27674.html
http://www.spinics.net/lists/ceph-devel/msg30010.html

The "We are more dependent on client-side readahead with bluestore since
there is no underlying filesystem below the OSDs helping us out." bit is
what was stuck in my head.

Thanks again,

Christian
> I'm not familiar enough with Bluestore to comment on the rest.
> 
> [1] http://www.spinics.net/lists/ceph-devel/msg29398.html
> 
> --
> Adam
> 
> On Thu, Jun 16, 2016 at 11:09 PM, Christian Balzer  wrote:
> >
> > Hello,
> >
> > I don't have anything running Jewel yet, so this is for devs and people
> > who have played with bluestore or read the code.
> >
> > With filestore, Ceph benefits from ample RAM, both in terms of
> > pagecache for reads of hot objects and SLAB to keep all the
> > dir-entries and inodes in memory.
> >
> > With bluestore not being a FS, I'm wondering what can and will be done
> > for it to maximize performance by using available RAM.
> > I doubt there's a dynamic cache allocation ala pagecache present or on
> > the road-map.
> > But how about parameters to grow caches (are there any?) and give the
> > DB more breathing space?
> >
> > I suppose this also cuts into the current inability to do read-ahead
> > with bluestore by itself (not client driven).
> >
> > The underlying reason for this of course to future proof OSD storage
> > servers, any journal SSDs will be beneficial for RocksDB and WAL as
> > well, but if available memory can't be utilized beyond what the OSDs
> > need themselves it makes little sense to put extra RAM into them.
> >
> > Christian
> > --
> > Christian BalzerNetwork/Systems Engineer
> > ch...@gol.com   Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore RAM usage/utilization

2016-06-16 Thread Adam Tygart
According to Sage[1] Bluestore makes use of the pagecache. I don't
believe read-ahead is a filesystem tunable in Linux, it is set on the
block device itself, therefore read-ahead shouldn't be an issue.

I'm not familiar enough with Bluestore to comment on the rest.

[1] http://www.spinics.net/lists/ceph-devel/msg29398.html

--
Adam

On Thu, Jun 16, 2016 at 11:09 PM, Christian Balzer  wrote:
>
> Hello,
>
> I don't have anything running Jewel yet, so this is for devs and people
> who have played with bluestore or read the code.
>
> With filestore, Ceph benefits from ample RAM, both in terms of pagecache
> for reads of hot objects and SLAB to keep all the dir-entries and inodes
> in memory.
>
> With bluestore not being a FS, I'm wondering what can and will be done for
> it to maximize performance by using available RAM.
> I doubt there's a dynamic cache allocation ala pagecache present or on the
> road-map.
> But how about parameters to grow caches (are there any?) and give the DB
> more breathing space?
>
> I suppose this also cuts into the current inability to do read-ahead with
> bluestore by itself (not client driven).
>
> The underlying reason for this of course to future proof OSD storage
> servers, any journal SSDs will be beneficial for RocksDB and WAL as well,
> but if available memory can't be utilized beyond what the OSDs need
> themselves it makes little sense to put extra RAM into them.
>
> Christian
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Bug found with CentOS 7.2

2016-06-16 Thread Yan, Zheng
On Fri, Jun 17, 2016 at 5:03 AM, Jason Gress  wrote:
> This is the latest default kernel with CentOS7.  We also tried a newer
> kernel (from elrepo), a 4.4 that has the same problem, so I don't think
> that is it.  Thank you for the suggestion though.
>
> We upgraded our cluster to the 10.2.2 release today, and it didn't resolve
> all of the issues.  It's possible that a related issue is actually
> permissions.  Something may not be right with our config (or a bug) here.
>
> While testing we noticed that there may actually be two issues here.  I am
> unsure, as we noticed that the most consistent way to reproduce our issue
> is to use vim or sed -i which does in place renames:
>
> [root@ftp01 cron]# ls -la
> total 3
> drwx--   1 root root 2044 Jun 16 15:50 .
> drwxr-xr-x. 10 root root  104 May 19 09:34 ..
> -rw-r--r--   1 root root  300 Jun 16 15:50 file
> -rw---   1 root root 2044 Jun 16 13:47 root
> [root@ftp01 cron]# sed -i 's/^/#/' file
> sed: cannot rename ./sedfB2CkO: Permission denied
>
>
> Strangely, adding or deleting files works fine, it's only renaming that
> fails.  And strangely I was able to successfully edit the file on ftp02:
>
> [root@ftp02 cron]# sed -i 's/^/#/' file
> [root@ftp02 cron]# ls -la
> total 3
> drwx--   1 root root 2044 Jun 16 15:49 .
> drwxr-xr-x. 10 root root  104 May 19 09:34 ..
> -rw-r--r--   1 root root  313 Jun 16 15:49 file
> -rw---   1 root root 2044 Jun 16 13:47 root
>
>
> Then it worked on ftp01 this time:
> [root@ftp01 cron]# ls -la
> total 3
> drwx--   1 root root 2357 Jun 16 15:49 .
> drwxr-xr-x. 10 root root  104 May 19 09:34 ..
> -rw-r--r--   1 root root  313 Jun 16 15:49 file
> -rw---   1 root root 2044 Jun 16 13:47 root
>
>
> Then, I vim'd it successfully on ftp01... Then ran the sed again:
>
> [root@ftp01 cron]# sed -i 's/^/#/' file
> sed: cannot rename ./sedfB2CkO: Permission denied
> [root@ftp01 cron]# ls -la
> total 3
> drwx--   1 root root 2044 Jun 16 15:51 .
> drwxr-xr-x. 10 root root  104 May 19 09:34 ..
> -rw-r--r--   1 root root  300 Jun 16 15:50 file
> -rw---   1 root root 2044 Jun 16 13:47 root
>
>
> And now we have the zero file problem again:
>
> [root@ftp02 cron]# ls -la
> total 2
> drwx--   1 root root 2044 Jun 16 15:51 .
> drwxr-xr-x. 10 root root  104 May 19 09:34 ..
> -rw-r--r--   1 root root0 Jun 16 15:50 file
> -rw---   1 root root 2044 Jun 16 13:47 root
>
>
> Anyway, I wonder how much of this issue is related to that cannot rename
> issue above.  Here are our security settings:
>
> client.ftp01
> key: 
> caps: [mds] allow r, allow rw path=/ftp
> caps: [mon] allow r
> caps: [osd] allow rw pool=cephfs_metadata, allow rw pool=cephfs_data
> client.ftp02
> key: 
> caps: [mds] allow r, allow rw path=/ftp
> caps: [mon] allow r
> caps: [osd] allow rw pool=cephfs_metadata, allow rw pool=cephfs_data
>
>
> /ftp is the directory on cephfs under which cron lives; the full path is
> /ftp/cron .
>
> I hope this helps and thank you for your time!

I opened  ticket http://tracker.ceph.com/issues/16358. The bug is in
path restriction code. For now, the workaround is updating client caps
to not use path restriction.

Regards
Yan, Zheng

>
> Jason
>
> On 6/15/16, 4:43 PM, "John Spray"  wrote:
>
>>On Wed, Jun 15, 2016 at 10:21 PM, Jason Gress 
>>wrote:
>>> While trying to use CephFS as a clustered filesystem, we stumbled upon a
>>> reproducible bug that is unfortunately pretty serious, as it leads to
>>>data
>>> loss.  Here is the situation:
>>>
>>> We have two systems, named ftp01 and ftp02.  They are both running
>>>CentOS
>>> 7.2, with this kernel release and ceph packages:
>>>
>>> kernel-3.10.0-327.18.2.el7.x86_64
>>
>>That is an old-ish kernel to be using with cephfs.  It may well be the
>>source of your issues.
>>
>>> [root@ftp01 cron]# rpm -qa | grep ceph
>>> ceph-base-10.2.1-0.el7.x86_64
>>> ceph-deploy-1.5.33-0.noarch
>>> ceph-mon-10.2.1-0.el7.x86_64
>>> libcephfs1-10.2.1-0.el7.x86_64
>>> ceph-selinux-10.2.1-0.el7.x86_64
>>> ceph-mds-10.2.1-0.el7.x86_64
>>> ceph-common-10.2.1-0.el7.x86_64
>>> ceph-10.2.1-0.el7.x86_64
>>> python-cephfs-10.2.1-0.el7.x86_64
>>> ceph-osd-10.2.1-0.el7.x86_64
>>>
>>> Mounted like so:
>>> XX.XX.XX.XX:/ftp/cron /var/spool/cron ceph
>>> _netdev,relatime,name=ftp01,secretfile=/etc/ceph/ftp01.secret 0 0
>>> And:
>>> XX.XX.XX.XX:/ftp/cron /var/spool/cron ceph
>>> _netdev,relatime,name=ftp02,secretfile=/etc/ceph/ftp02.secret 0 0
>>>
>>> This filesystem has 234GB worth of data on it, and I created another
>>> subdirectory and mounted it, NFS style.
>>>
>>> Here were the steps to reproduce:
>>>
>>> First, I created a file (I was mounting /var/spool/cron on two systems)
>>>on
>>> ftp01:
>>> (crond is not running right now on either system to keep the variables
>>>down)
>>>
>>> [root@ftp01 cron]# cp /tmp/root .
>>>
>>> Shows up on both fine:
>>> [root@ftp01 cron]# ls -la
>>> total 2
>>> drwx--   1 root root0 Jun 15 15:50

[ceph-users] Bluestore RAM usage/utilization

2016-06-16 Thread Christian Balzer

Hello,

I don't have anything running Jewel yet, so this is for devs and people
who have played with bluestore or read the code.

With filestore, Ceph benefits from ample RAM, both in terms of pagecache
for reads of hot objects and SLAB to keep all the dir-entries and inodes
in memory.

With bluestore not being a FS, I'm wondering what can and will be done for
it to maximize performance by using available RAM.
I doubt there's a dynamic cache allocation ala pagecache present or on the
road-map. 
But how about parameters to grow caches (are there any?) and give the DB
more breathing space?

I suppose this also cuts into the current inability to do read-ahead with
bluestore by itself (not client driven).

The underlying reason for this of course to future proof OSD storage
servers, any journal SSDs will be beneficial for RocksDB and WAL as well,
but if available memory can't be utilized beyond what the OSDs need
themselves it makes little sense to put extra RAM into them.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Mysterious cache-tier flushing behavior

2016-06-16 Thread Christian Balzer

Hello devs and other sage(sic) people,

Ceph 0.94.5, cache tier in writeback mode.

As mentioned before, I'm running a cron job every day at 23:40 dropping
the flush dirty target by 4% (0.60 to 0.56) and then re-setting it to the
previous value 10 minutes later.
The idea is to have all the flushing done during off-peak hours and that
works beautifully.
No flushes during day time, only lightweight evicts.

Now I'm graphing all kinds of Ceph and system related info with graphite
and noticed something odd.

When the flushes are initiated, the HDD space of the OSDs in the backing
store drops by a few GB, pretty much the amount of dirty objects over the
threshold accumulated during a day, so no surprise there. 
This happens every time when that cron job runs.

However only on some days this drop (more pronounced on those days) is
accompanied by actual:
a) flushes according to the respective Ceph counters
b) network traffic from the cache-tier to the backing OSDs
c) HDD OSD writes (both from OSD perspective and actual HDD)
d) cache pool SSD reads (both from OSD perspective and actual SSD)

So what is happening on the other days?

The space clearly is gone and triggered by the "flush", but no data was
actually transfered to the HDD OSD nodes, nor was there anything (newly)
written.

Dazed and confused,

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Bug found with CentOS 7.2

2016-06-16 Thread Yan, Zheng
On Fri, Jun 17, 2016 at 5:18 AM, Adam Tygart  wrote:
> This sounds an awful lot like a a bug I've run into a few times (not
> often enough to get a good backtrace out of the kernel or mds)
> involving vim on a symlink to a file in another directory. It will
> occasionally corrupt the symlink in such a way that the symlink is
> unreadable. Filling dmesg with:
>
> [ 2368.036667] ceph: fill_inode badness on 8800bb5fb610
> [ 2368.969657] [ cut here ]
> [ 2368.969670] WARNING: CPU: 0 PID: 15 at fs/ceph/inode.c:813
> fill_inode.isra.19+0x4b1/0xa49()
> [ 2368.969672] Modules linked in:
> [ 2368.969684] CPU: 0 PID: 15 Comm: kworker/0:1 Tainted: GW
>4.5.0-gentoo #1
> [ 2368.969686] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
> [ 2368.969693] Workqueue: ceph-msgr ceph_con_workfn
> [ 2368.969695]  0286 7000a7b9 88017e267af0
> b142ec39
> [ 2368.969698]   0009 88017e267b28
> b1091c83
> [ 2368.969700]  b13be512 c900020da8cd 880427a30230
> 
> [ 2368.969704] Call Trace:
> [ 2368.969709]  [] dump_stack+0x63/0x7f
> [ 2368.969714]  [] warn_slowpath_common+0x9a/0xb3
> [ 2368.969717]  [] ? fill_inode.isra.19+0x4b1/0xa49
> [ 2368.969719]  [] warn_slowpath_null+0x15/0x17
> [ 2368.969722]  [] fill_inode.isra.19+0x4b1/0xa49
> [ 2368.969724]  [] ? ceph_mount+0x729/0x72e
> [ 2368.969727]  [] ceph_readdir_prepopulate+0x48f/0x70c
> [ 2368.969730]  [] dispatch+0xebf/0x1428
> [ 2368.969752]  [] ? 
> ceph_x_check_message_signature+0x42/0xc4
> [ 2368.969756]  [] ceph_con_workfn+0xe1a/0x24f3
> [ 2368.969759]  [] ? load_TLS+0xb/0xf
> [ 2368.969761]  [] ? __switch_to+0x3b0/0x42b
> [ 2368.969765]  [] ? finish_task_switch+0xff/0x191
> [ 2368.969768]  [] process_one_work+0x175/0x2a0
> [ 2368.969770]  [] worker_thread+0x1fc/0x2ae
> [ 2368.969772]  [] ? rescuer_thread+0x2c0/0x2c0
> [ 2368.969775]  [] kthread+0xaf/0xb7
> [ 2368.969777]  [] ? kthread_parkme+0x1f/0x1f
> [ 2368.969780]  [] ret_from_fork+0x3f/0x70
> [ 2368.969782]  [] ? kthread_parkme+0x1f/0x1f
> [ 2368.969784] ---[ end trace b054c5c6854fd2ab ]---
> [ 2368.969786] ceph: fill_inode badness on 880428185d70
> [ 2370.289733] [ cut here ]
> [ 2370.289747] WARNING: CPU: 0 PID: 15 at fs/ceph/inode.c:813
> fill_inode.isra.19+0x4b1/0xa49()
> [ 2370.289750] Modules linked in:
> [ 2370.289756] CPU: 0 PID: 15 Comm: kworker/0:1 Tainted: GW
>4.5.0-gentoo #1
> [ 2370.289759] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
> [ 2370.289767] Workqueue: ceph-msgr ceph_con_workfn
> [ 2370.289769]  0286 7000a7b9 88017e267af0
> b142ec39
> [ 2370.289774]   0009 88017e267b28
> b1091c83
> [ 2370.289777]  b13be512 c900020f58cd 880427a30230
> 
> [ 2370.289781] Call Trace:
> [ 2370.289787]  [] dump_stack+0x63/0x7f
> [ 2370.289793]  [] warn_slowpath_common+0x9a/0xb3
> [ 2370.289797]  [] ? fill_inode.isra.19+0x4b1/0xa49
> [ 2370.289801]  [] warn_slowpath_null+0x15/0x17
> [ 2370.289804]  [] fill_inode.isra.19+0x4b1/0xa49
> [ 2370.289807]  [] ? ceph_mount+0x729/0x72e
> [ 2370.289811]  [] ceph_readdir_prepopulate+0x48f/0x70c
> [ 2370.289815]  [] dispatch+0xebf/0x1428
> [ 2370.289821]  [] ? 
> ceph_x_check_message_signature+0x42/0xc4
> [ 2370.289824]  [] ceph_con_workfn+0xe1a/0x24f3
> [ 2370.289829]  [] ? load_TLS+0xb/0xf
> [ 2370.289832]  [] ? __switch_to+0x3b0/0x42b
> [ 2370.289837]  [] ? finish_task_switch+0xff/0x191
> [ 2370.289841]  [] process_one_work+0x175/0x2a0
> [ 2370.289843]  [] worker_thread+0x1fc/0x2ae
> [ 2370.289846]  [] ? rescuer_thread+0x2c0/0x2c0
> [ 2370.289849]  [] kthread+0xaf/0xb7
> [ 2370.289853]  [] ? kthread_parkme+0x1f/0x1f
> [ 2370.289857]  [] ret_from_fork+0x3f/0x70
> [ 2370.289860]  [] ? kthread_parkme+0x1f/0x1f
> [ 2370.289863] ---[ end trace b054c5c6854fd2ac ]---
> [ 2370.289865] ceph: fill_inode badness on 880428185d70
> [ 2371.525649] [ cut here ]
> [ 2371.525663] WARNING: CPU: 0 PID: 15 at fs/ceph/inode.c:813
> fill_inode.isra.19+0x4b1/0xa49()
> [ 2371.525665] Modules linked in:
> [ 2371.525670] CPU: 0 PID: 15 Comm: kworker/0:1 Tainted: GW
>4.5.0-gentoo #1
> [ 2371.525672] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
> [ 2371.525679] Workqueue: ceph-msgr ceph_con_workfn
> [ 2371.525682]  0286 7000a7b9 88017e267af0
> b142ec39
> [ 2371.525685]   0009 88017e267b28
> b1091c83
> [ 2371.525687]  b13be512 c900021108cd 880427a30230
> 
> [ 2371.525690] Call Trace:
> [ 2371.525696]  [] dump_stack+0x63/0x7f
> [ 2371.525701]  [] warn_slowpath_common+0x9a/0xb3
> [ 2371.525

Re: [ceph-users] CEPH with NVMe SSDs and Caching vs Journaling on SSDs

2016-06-16 Thread Christian Balzer

Hello,

On Thu, 16 Jun 2016 15:31:13 + Tim Gipson wrote:

> A few questions.
> 
> First, is there a good step by step to setting up a caching tier with
> NVMe SSDs that are on separate hosts?  Is that even possible?
> 
Yes. And with a cluster of your size that's the way I'd do it.
Larger cluster (dozen plus nodes) are likely to be better suited with
storage nodes that have shared HDD OSDs for slow storage and SSD OSDs for
cache pools.

It would behoove you to scour this ML for the dozens of threads covering
this and other aspects, like:
"journal or cache tier on SSDs ?"
"Steps for Adding Cache Tier"
and even yesterdays:
"Is Dynamic Cache tiering supported in Jewel"

> Second, what sort of performance are people seeing from caching
> tiers/journaling on SSDs in Jewel?
> 
Not using Jewel, but it's bound to be better than Hammer.

Performance will depend on a myriad of things, including CPU, SSD/NVMe
models, networking, tuning, etc.
It would be better if you had a performance target and a budget to see if
they can be matched up.

Cache tiering and journaling are very different things, don't mix them up.

> Right now I am working on trying to find best practice for a CEPH
> cluster with 3 monitor nodes, and 3 OSDs with 1 800GB NVMe drive and 12
> 6TB drives.
> 
No need for dedicated monitor notes (definitely not 3 and with cluster of
that size) if your storage nodes are designed correctly, see for example:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-April/008879.html

> My goal is reliable/somewhat fast performance.
>
Well, for starters this cluster will give you the space of one of these
nodes and worse performance than a single node due to the 3x replication.

What NVMe did you have in mind, a DC P3600 will give you 1GB/s writes
(and 3DWPD endurance), a P3700 2GB/s (and 10DWPD endurance).

What about your network?

Since the default failure domain in Ceph is the host, a single NVMe as
journal for all HDD OSDs isn't particular risky, but it's something to
keep in mind.
 
Christian
> Any help would be greatly appreciated!
> 
> Tim Gipson
> Systems Engineer
> 
> [http://www.ena.com/signature/enaemaillogo.gif]
> 
> 
> 618 Grassmere Park Drive, Suite 12
> Nashville, TN 37211
> 
> 
> 
> website | blog |
> support
> 
> 
> [http://www.ena.com/signature/facebook.png]
> [http://www.ena.com/signature/twitter.png]
> 
> [http://www.ena.com/signature/linkedin.png]
> 
> [http://www.ena.com/signature/youtube.png]
> 
> 
> 
> 
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd ioengine for fio

2016-06-16 Thread Mavis Xiang
this really did the trick! Thank you guys!

Best,
Mavis

On Thu, Jun 16, 2016 at 8:18 PM, Jason Dillaman  wrote:

> On Thu, Jun 16, 2016 at 8:14 PM, Mavis Xiang  wrote:
> > clientname=client.admin
>
> Try "clientname=admin" -- I think it's treating the client "name" as
> the "id", so specifying "client.admin" is really treated as
> "client.client.admin".
>
> --
> Jason
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph benchmark

2016-06-16 Thread Christian Balzer

Hello,

as others already stated, your numbers don't add up or make sense.

More below.

On Thu, 16 Jun 2016 16:53:10 -0400 Patrick McGarry wrote:

> Moving this over to ceph-user where it’ll get the eyeballs you need.
> 
> On Mon, Jun 13, 2016 at 2:58 AM, Marcus Strasser
>  wrote:
> > Hello!
> >
> >
> >
> > I have a little test cluster with 2 server. Each Server have an osd
> > with 800 GB, there is a 10 Gbps Link between the servers.
> >
What kind of OSDs are these?
The size suggests SSDs/NVMes, but without this information a huge piece of
the puzzle is missing. 
Exact models please.

Since you have 2 nodes, I presume you changed replication from 3 to 2.

This will give you better results, but you want to use 3 in real-life, so
your test results will be flawed, keep that in mind.

> > On a ceph-client i have configured a cephfs, mount kernelspace. The
> > client is also connected with a 10 Gbps Link.
> >
With kernelspace and w/o specifying direct writes in dd most of 64GB of
your client will be used as pagecache.

> > All 3 use debian
> >
> > 4.5.5 kernel
> >
> > 64 GB mem
> >
> > There is no special configuration.
> >
> >
> >
> > Now the question:
> >
> > When i use the dd (~11GB) command in the cephfs mount, i get a result
> > of 3 GB/s
> >
3GB/s is already 33% faster than then network, so you're seeing caching as
noted above. 
The most sustainable speed you'd be able to achieve in your setup would be
1GB/s, but that's overly simplistic and optimistic.

Also at this time I'd like to add my usual comment that in over 90% of all
use cases speed as in bandwidth is a distant second to the much more
important speed in terms of IOPS.

> >
> >
> > dd if=/dev/zero of=/cephtest/test bs=1M count=10240
> >
> >
> >
> > Is it possble to transfer the data faster (use full capacity oft he
> > network) and cache it with the memory?
> >
Again, according to your numbers and description that's already happening.

Note that RAM on the storage serves will NOT help with write speeds, it
will be helpful for reads and a large SLAB can prevent unnecessary disk
accesses. 

Christian
> >
> >
> > Thanks,
> >
> > Marcus Strasser
> >
> >
> >
> >
> >
> > Marcus Strasser
> >
> > Linux Systeme
> >
> > Russmedia IT GmbH
> >
> > A-6850 Schwarzach, Gutenbergstr. 1
> >
> >
> >
> > T +43 5572 501-872
> >
> > F +43 5572 501-97872
> >
> > marcus.stras...@highspeed.vol.at
> >
> > highspeed.vol.at
> >
> >
> >
> >
> 
> 
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd ioengine for fio

2016-06-16 Thread Jason Dillaman
On Thu, Jun 16, 2016 at 8:14 PM, Mavis Xiang  wrote:
> clientname=client.admin

Try "clientname=admin" -- I think it's treating the client "name" as
the "id", so specifying "client.admin" is really treated as
"client.client.admin".

-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd ioengine for fio

2016-06-16 Thread Mavis Xiang
Hi Somnath,

Thank you for your reply!!
The if script is:

[global]

#logging

#write_iops_log=write_iops_log

#write_bw_log=write_bw_log

#write_lat_log=write_lat_log

ioengine=rbd

clientname=client.admin

pool=ecssdcache

rbdname=imagecacherbd

invalidate=0

rw=randwrite

bs=4k


[rbd_iodepth32]

iodepth=32

The pool and rbd image names are correct.

Ceph -s from the rbd client and the monitor server both shows as:

cluster e414604c-29d7-4adb-a889-7f70fc252dfa

 health HEALTH_WARN clock skew detected on mon.h02, mon.h05

 monmap e3: 3 mons at {h02=
130.4.240.102:6789/0,h05=130.4.240.105:6789/0,h08=130.4.240.78:6789/0},
election epoch 3212, quorum 0,1,2 h08,h02,h05

 osdmap e23689: 39 osds: 35 up, 35 in

  pgmap v3174229: 16126 pgs, 8 pools, 132 GB data, 198 kobjects

545 GB used, 29224 GB / 29769 GB avail

   16126 active+clean

I've also checked that connection to the monitor hosts from the rbd client
looks good too.


Really not sure what's going on..


Thanks in advance all!


Best,

Mavis


On Thu, Jun 16, 2016 at 4:52 PM, Somnath Roy 
wrote:

> What is your fio script ?
>
>
>
> Make sure you do this..
>
>
>
> 1. Run say ‘ceph-s’ from  the server you are trying to connect and see if
> it is connecting properly or not. If so, you don’t have any keyring issues.
>
>
>
> 2. Now, make sure you have given the following param value properly based
> on your setup.
>
>
>
> pool=
>
> rbdname=
>
>
>
> Thanks & Regards
>
> Somnath
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *Mavis Xiang
> *Sent:* Thursday, June 16, 2016 1:47 PM
> *To:* ceph-users@lists.ceph.com
> *Subject:* [ceph-users] rbd ioengine for fio
>
>
>
> Hi all,
>
> I am new to the rbd engine for fio, and ran into the following problems
> when i try to run a 4k write with my rbd image:
>
>
>
>
>
> rbd_iodepth32: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd,
> iodepth=32
>
> fio-2.11-17-ga275
>
> Starting 1 process
>
> rbd engine: RBD version: 0.1.8
>
> rados_connect failed.
>
> fio_rbd_connect failed.
>
>
>
> It seems that the rbd client cannot connect to the ceph cluster.
>
> Ceph health output:
>
> cluster e414604c-29d7-4adb-a889-7f70fc252dfa
>
>  health HEALTH_WARN clock skew detected on mon.h02, mon.h05
>
>
>
> But it should not affected the connection to the cluster.
>
> Ceph.conf:
>
> [global]
>
> fsid = e414604c-29d7-4adb-a889-7f70fc252dfa
>
> mon_initial_members = h02
>
> mon_host = XXX.X.XXX.XXX
>
> auth_cluster_required = cephx
>
> auth_service_required = cephx
>
> auth_client_required = cephx
>
> filestore_xattr_use_omap = true
>
> osd_pool_default_pg_num = 2400
>
> osd_pool_default_pgp_num = 2400
>
> public_network = XXX.X.XXX.X/21
>
>
>
> [osd]
>
> osd_crush_update_on_start = false
>
>
>
>
>
>
>
> Should this be something about keyring? i did not find any options about
> keyring that can be set in fio file.
>
> Can anyone please give some insights about this problem?
>
> Any help would be appreciated!
>
>
>
> Thanks!
>
>
>
> Yu
>
>
>
>
> PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the designated recipient(s) named above. If
> the reader of this message is not the intended recipient, you are hereby
> notified that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly
> prohibited. If you have received this communication in error, please notify
> the sender by telephone or e-mail (as shown above) immediately and destroy
> any and all copies of this message in your possession (whether hard copies
> or electronically stored copies).
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph benchmark

2016-06-16 Thread Karan Singh
Agree with David

Its being cached , you can try
- oflag options for dd
- monitor system cache during dd


- Karan -

On Fri, Jun 17, 2016 at 1:58 AM, David  wrote:

> I'm probably misunderstanding the question but if you're getting 3GB/s
> from your dd, you're already caching. Can you provide some more detail on
> what you're trying to achieve.
> On 16 Jun 2016 21:53, "Patrick McGarry"  wrote:
>
>> Moving this over to ceph-user where it’ll get the eyeballs you need.
>>
>> On Mon, Jun 13, 2016 at 2:58 AM, Marcus Strasser
>>  wrote:
>> > Hello!
>> >
>> >
>> >
>> > I have a little test cluster with 2 server. Each Server have an osd
>> with 800
>> > GB, there is a 10 Gbps Link between the servers.
>> >
>> > On a ceph-client i have configured a cephfs, mount kernelspace. The
>> client
>> > is also connected with a 10 Gbps Link.
>> >
>> > All 3 use debian
>> >
>> > 4.5.5 kernel
>> >
>> > 64 GB mem
>> >
>> > There is no special configuration.
>> >
>> >
>> >
>> > Now the question:
>> >
>> > When i use the dd (~11GB) command in the cephfs mount, i get a result
>> of 3
>> > GB/s
>> >
>> >
>> >
>> > dd if=/dev/zero of=/cephtest/test bs=1M count=10240
>> >
>> >
>> >
>> > Is it possble to transfer the data faster (use full capacity oft he
>> network)
>> > and cache it with the memory?
>> >
>> >
>> >
>> > Thanks,
>> >
>> > Marcus Strasser
>> >
>> >
>> >
>> >
>> >
>> > Marcus Strasser
>> >
>> > Linux Systeme
>> >
>> > Russmedia IT GmbH
>> >
>> > A-6850 Schwarzach, Gutenbergstr. 1
>> >
>> >
>> >
>> > T +43 5572 501-872
>> >
>> > F +43 5572 501-97872
>> >
>> > marcus.stras...@highspeed.vol.at
>> >
>> > highspeed.vol.at
>> >
>> >
>> >
>> >
>>
>>
>>
>> --
>>
>> Best Regards,
>>
>> Patrick McGarry
>> Director Ceph Community || Red Hat
>> http://ceph.com  ||  http://community.redhat.com
>> @scuttlemonkey || @ceph
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph benchmark

2016-06-16 Thread David
I'm probably misunderstanding the question but if you're getting 3GB/s from
your dd, you're already caching. Can you provide some more detail on what
you're trying to achieve.
On 16 Jun 2016 21:53, "Patrick McGarry"  wrote:

> Moving this over to ceph-user where it’ll get the eyeballs you need.
>
> On Mon, Jun 13, 2016 at 2:58 AM, Marcus Strasser
>  wrote:
> > Hello!
> >
> >
> >
> > I have a little test cluster with 2 server. Each Server have an osd with
> 800
> > GB, there is a 10 Gbps Link between the servers.
> >
> > On a ceph-client i have configured a cephfs, mount kernelspace. The
> client
> > is also connected with a 10 Gbps Link.
> >
> > All 3 use debian
> >
> > 4.5.5 kernel
> >
> > 64 GB mem
> >
> > There is no special configuration.
> >
> >
> >
> > Now the question:
> >
> > When i use the dd (~11GB) command in the cephfs mount, i get a result of
> 3
> > GB/s
> >
> >
> >
> > dd if=/dev/zero of=/cephtest/test bs=1M count=10240
> >
> >
> >
> > Is it possble to transfer the data faster (use full capacity oft he
> network)
> > and cache it with the memory?
> >
> >
> >
> > Thanks,
> >
> > Marcus Strasser
> >
> >
> >
> >
> >
> > Marcus Strasser
> >
> > Linux Systeme
> >
> > Russmedia IT GmbH
> >
> > A-6850 Schwarzach, Gutenbergstr. 1
> >
> >
> >
> > T +43 5572 501-872
> >
> > F +43 5572 501-97872
> >
> > marcus.stras...@highspeed.vol.at
> >
> > highspeed.vol.at
> >
> >
> >
> >
>
>
>
> --
>
> Best Regards,
>
> Patrick McGarry
> Director Ceph Community || Red Hat
> http://ceph.com  ||  http://community.redhat.com
> @scuttlemonkey || @ceph
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg has invalid (post-split) stats; must scrub before tier agent can activate

2016-06-16 Thread Stillwell, Bryan J
I wanted to report back what the solution was to this problem.  It appears
like I was running into this bug:

http://tracker.ceph.com/issues/16113


After running 'ceph osd unset sortbitwise' all the unfound objects were
found!  Which makes me happy again.  :)

Bryan

On 5/24/16, 4:27 PM, "Stillwell, Bryan J" 
wrote:

>On one of my test clusters that I¹ve upgraded from Infernalis to Jewel
>(10.2.1), and I¹m having a problem where reads are resulting in unfound
>objects.
>
>I¹m using cephfs on top of a erasure coded pool with cache tiering which I
>believe is related.
>
>From what I can piece together, here is what the sequence of events looks
>like:
>
>Try doing an md5sum on all files in a directory:
>
>$ date
>Tue May 24 16:06:01 MDT 2016
>$ md5sum *
>
>
>Shortly afterward I see this in the logs:
>
>2016-05-24 16:06:20.406701 mon.0 172.24.88.20:6789/0 222796 : cluster
>[INF] osd.24 172.24.88.54:6814/26253 failed (2 reporters from different
>host after 21.000162 >= grace 20.00)
>2016-05-24 16:06:22.626169 mon.0 172.24.88.20:6789/0 222805 : cluster
>[INF] osd.24 172.24.88.54:6813/21502 boot
>
>2016-05-24 16:06:22.760512 mon.0 172.24.88.20:6789/0 222809 : cluster
>[INF] osd.21 172.24.88.56:6828/26011 failed (2 reporters from different
>host after 21.000314 >= grace 20.00)
>2016-05-24 16:06:24.980100 osd.23 172.24.88.54:6803/15322 120 : cluster
>[WRN] pg 4.3d has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:06:24.935090 osd.16 172.24.88.56:6824/25830 8 : cluster
>[WRN] pg 4.2e has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:06:25.023810 osd.16 172.24.88.56:6824/25830 9 : cluster
>[WRN] pg 4.15 has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:06:25.063645 osd.16 172.24.88.56:6824/25830 10 : cluster
>[WRN] pg 4.21 has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:06:25.326786 osd.16 172.24.88.56:6824/25830 11 : cluster
>[WRN] pg 4.3e has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:06:25.887230 osd.26 172.24.88.56:6808/10047 56 : cluster
>[WRN] pg 4.f has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:06:31.413173 osd.12 172.24.88.56:6820/3496 509 : cluster
>[WRN] pg 4.a has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:06:24.758508 osd.25 172.24.88.54:6801/25977 34 : cluster
>[WRN] pg 4.11 has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:06:25.307177 osd.24 172.24.88.54:6813/21502 1 : cluster
>[WRN] pg 4.13 has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:06:25.061032 osd.18 172.24.88.20:6806/23166 65 : cluster
>[WRN] pg 4.3 has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:06:25.216812 osd.22 172.24.88.20:6816/32656 24 : cluster
>[WRN] pg 4.12 has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:07:15.393004 mon.0 172.24.88.20:6789/0 222885 : cluster
>[INF] osd.21 172.24.88.56:6800/27171 boot
>2016-05-24 16:07:30.986037 osd.12 172.24.88.56:6820/3496 510 : cluster
>[WRN] pg 4.a has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:07:30.606189 osd.24 172.24.88.54:6813/21502 2 : cluster
>[WRN] pg 4.13 has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:07:30.011658 osd.22 172.24.88.20:6816/32656 27 : cluster
>[WRN] pg 4.12 has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:07:30.744757 osd.18 172.24.88.20:6806/23166 66 : cluster
>[WRN] pg 4.3 has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:07:30.160872 osd.23 172.24.88.54:6803/15322 121 : cluster
>[WRN] pg 4.3d has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:07:30.945012 osd.21 172.24.88.56:6800/27171 2 : cluster
>[WRN] pg 4.11 has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:07:30.974664 osd.21 172.24.88.56:6800/27171 3 : cluster
>[WRN] pg 4.21 has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:07:30.978548 osd.21 172.24.88.56:6800/27171 4 : cluster
>[WRN] pg 4.2e has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:07:32.394111 osd.21 172.24.88.56:6800/27171 5 : cluster
>[WRN] pg 4.f has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:07:29.828650 osd.16 172.24.88.56:6824/25830 12 : cluster
>[WRN] pg 4.3e has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:07:30.024493 osd.16 172.24.88.56:6824/25830 13 : cluster
>[WRN] pg 4.15 has invalid (post-split) stats; must scrub before tier agent
>can activate
>
>
>
>
>Then I see the following 

Re: [ceph-users] [Ceph-community] Regarding Technical Possibility of Configuring Single Ceph Cluster on Different Networks

2016-06-16 Thread Brad Hubbard
On Fri, Jun 10, 2016 at 3:01 AM, Venkata Manojawa Paritala
 wrote:
> Hello Friends,
>
> I am Manoj Paritala, working in Vedams Software Solutions India Pvt Ltd,
> Hyderabad, India. We are developing a POC with the below specification. I
> would like to know if it is technically possible to configure a Single Ceph
> cluster with this requirement. Please find attached the network diagram for
> more clarity on what we are trying to setup.
>
> 1. There should be 3 OSD nodes (machines), 3 Monitor nodes (machines) and 3
> Client nodes in the Ceph cluster.
>
> 2. There are 3 data centers with 3 different networks. Lets call each Data
> center a Site. So, we have Site1, Site2 and Site3 with different networks.
>
> 3. Each Site should have One OSD node + Monitor node + Client node.
>
> 4. In each Site there should be again 2 sub-networks.
>
> 4a. Site Public Network :- Where in the Ceph Clients, OSDs and Monitor would
> connect.
> 4b. Site Cluster Network :- Where in only OSDs communicate for replication
> and rebalancing.
>
> 5. Configure routing between Cluster networks across sites, in such a way
> that OSD in one site can communicate to the OSDs on other sites.
>
> 6. Configure routing between Site Public Networks across, in such a way that
> ONLY the Monitor & OSD nodes in each site can communicate to the nodes in
> other sites. PLEASE NOTE, CLIENTS IN ONE SITE WILL NOT BE ABLE TO
> COMMUNICATE TO OSDs/CLIENTS ON OTHER SITES.

This won't work. The clients need to communicate with the primary OSD for the pg
not just any OSD so will need access to all OSDs.

A configuration like this is a stretched cluster and the links between
the DCs will kill
performance once you load them up or once recovery is occurring. Do
the links between
your Dcs meet the stated requirements here?

http://docs.ceph.com/docs/master/start/hardware-recommendations/#networks

>
> Hoping that my requirement is clear. Please let me know if I am not clear on
> any step.
>
> Actually, based on our reading, our understanding is that 2-way replication
> between 2 different Ceph clusters is not possible. To overcome the same, we
> came up with the above configuration, which will allow us to create pools
> with OSDs on different sites / data centers and is useful for disaster
> recovery.

I don't think this configuration will work as you expect.

>
> In case our proposed configuration is not possible, can you please suggest
> us an alternative approach to achieve our requirement.

What is your requirement, it's not clearly stated.

Cheers,
Brad

>
> Thanks & Regards,
> Manoj
>
> ___
> Ceph-community mailing list
> ceph-commun...@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com
>



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Bug found with CentOS 7.2

2016-06-16 Thread Adam Tygart
This sounds an awful lot like a a bug I've run into a few times (not
often enough to get a good backtrace out of the kernel or mds)
involving vim on a symlink to a file in another directory. It will
occasionally corrupt the symlink in such a way that the symlink is
unreadable. Filling dmesg with:

[ 2368.036667] ceph: fill_inode badness on 8800bb5fb610
[ 2368.969657] [ cut here ]
[ 2368.969670] WARNING: CPU: 0 PID: 15 at fs/ceph/inode.c:813
fill_inode.isra.19+0x4b1/0xa49()
[ 2368.969672] Modules linked in:
[ 2368.969684] CPU: 0 PID: 15 Comm: kworker/0:1 Tainted: GW
   4.5.0-gentoo #1
[ 2368.969686] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
[ 2368.969693] Workqueue: ceph-msgr ceph_con_workfn
[ 2368.969695]  0286 7000a7b9 88017e267af0
b142ec39
[ 2368.969698]   0009 88017e267b28
b1091c83
[ 2368.969700]  b13be512 c900020da8cd 880427a30230

[ 2368.969704] Call Trace:
[ 2368.969709]  [] dump_stack+0x63/0x7f
[ 2368.969714]  [] warn_slowpath_common+0x9a/0xb3
[ 2368.969717]  [] ? fill_inode.isra.19+0x4b1/0xa49
[ 2368.969719]  [] warn_slowpath_null+0x15/0x17
[ 2368.969722]  [] fill_inode.isra.19+0x4b1/0xa49
[ 2368.969724]  [] ? ceph_mount+0x729/0x72e
[ 2368.969727]  [] ceph_readdir_prepopulate+0x48f/0x70c
[ 2368.969730]  [] dispatch+0xebf/0x1428
[ 2368.969752]  [] ? ceph_x_check_message_signature+0x42/0xc4
[ 2368.969756]  [] ceph_con_workfn+0xe1a/0x24f3
[ 2368.969759]  [] ? load_TLS+0xb/0xf
[ 2368.969761]  [] ? __switch_to+0x3b0/0x42b
[ 2368.969765]  [] ? finish_task_switch+0xff/0x191
[ 2368.969768]  [] process_one_work+0x175/0x2a0
[ 2368.969770]  [] worker_thread+0x1fc/0x2ae
[ 2368.969772]  [] ? rescuer_thread+0x2c0/0x2c0
[ 2368.969775]  [] kthread+0xaf/0xb7
[ 2368.969777]  [] ? kthread_parkme+0x1f/0x1f
[ 2368.969780]  [] ret_from_fork+0x3f/0x70
[ 2368.969782]  [] ? kthread_parkme+0x1f/0x1f
[ 2368.969784] ---[ end trace b054c5c6854fd2ab ]---
[ 2368.969786] ceph: fill_inode badness on 880428185d70
[ 2370.289733] [ cut here ]
[ 2370.289747] WARNING: CPU: 0 PID: 15 at fs/ceph/inode.c:813
fill_inode.isra.19+0x4b1/0xa49()
[ 2370.289750] Modules linked in:
[ 2370.289756] CPU: 0 PID: 15 Comm: kworker/0:1 Tainted: GW
   4.5.0-gentoo #1
[ 2370.289759] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
[ 2370.289767] Workqueue: ceph-msgr ceph_con_workfn
[ 2370.289769]  0286 7000a7b9 88017e267af0
b142ec39
[ 2370.289774]   0009 88017e267b28
b1091c83
[ 2370.289777]  b13be512 c900020f58cd 880427a30230

[ 2370.289781] Call Trace:
[ 2370.289787]  [] dump_stack+0x63/0x7f
[ 2370.289793]  [] warn_slowpath_common+0x9a/0xb3
[ 2370.289797]  [] ? fill_inode.isra.19+0x4b1/0xa49
[ 2370.289801]  [] warn_slowpath_null+0x15/0x17
[ 2370.289804]  [] fill_inode.isra.19+0x4b1/0xa49
[ 2370.289807]  [] ? ceph_mount+0x729/0x72e
[ 2370.289811]  [] ceph_readdir_prepopulate+0x48f/0x70c
[ 2370.289815]  [] dispatch+0xebf/0x1428
[ 2370.289821]  [] ? ceph_x_check_message_signature+0x42/0xc4
[ 2370.289824]  [] ceph_con_workfn+0xe1a/0x24f3
[ 2370.289829]  [] ? load_TLS+0xb/0xf
[ 2370.289832]  [] ? __switch_to+0x3b0/0x42b
[ 2370.289837]  [] ? finish_task_switch+0xff/0x191
[ 2370.289841]  [] process_one_work+0x175/0x2a0
[ 2370.289843]  [] worker_thread+0x1fc/0x2ae
[ 2370.289846]  [] ? rescuer_thread+0x2c0/0x2c0
[ 2370.289849]  [] kthread+0xaf/0xb7
[ 2370.289853]  [] ? kthread_parkme+0x1f/0x1f
[ 2370.289857]  [] ret_from_fork+0x3f/0x70
[ 2370.289860]  [] ? kthread_parkme+0x1f/0x1f
[ 2370.289863] ---[ end trace b054c5c6854fd2ac ]---
[ 2370.289865] ceph: fill_inode badness on 880428185d70
[ 2371.525649] [ cut here ]
[ 2371.525663] WARNING: CPU: 0 PID: 15 at fs/ceph/inode.c:813
fill_inode.isra.19+0x4b1/0xa49()
[ 2371.525665] Modules linked in:
[ 2371.525670] CPU: 0 PID: 15 Comm: kworker/0:1 Tainted: GW
   4.5.0-gentoo #1
[ 2371.525672] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
[ 2371.525679] Workqueue: ceph-msgr ceph_con_workfn
[ 2371.525682]  0286 7000a7b9 88017e267af0
b142ec39
[ 2371.525685]   0009 88017e267b28
b1091c83
[ 2371.525687]  b13be512 c900021108cd 880427a30230

[ 2371.525690] Call Trace:
[ 2371.525696]  [] dump_stack+0x63/0x7f
[ 2371.525701]  [] warn_slowpath_common+0x9a/0xb3
[ 2371.525704]  [] ? fill_inode.isra.19+0x4b1/0xa49
[ 2371.525707]  [] warn_slowpath_null+0x15/0x17
[ 2371.525740]  [] fill_inode.isra.19+0x4b1/0xa49
[ 2371.525744]  [] ? ceph_mount+0x729/0x72e
[ 2371.525747]  [] ceph_readdir_prepopulate+0x48f/0x70c
[ 2371.525751]  [] dis

Re: [ceph-users] CephFS Bug found with CentOS 7.2

2016-06-16 Thread Jason Gress
This is the latest default kernel with CentOS7.  We also tried a newer
kernel (from elrepo), a 4.4 that has the same problem, so I don't think
that is it.  Thank you for the suggestion though.

We upgraded our cluster to the 10.2.2 release today, and it didn't resolve
all of the issues.  It's possible that a related issue is actually
permissions.  Something may not be right with our config (or a bug) here.

While testing we noticed that there may actually be two issues here.  I am
unsure, as we noticed that the most consistent way to reproduce our issue
is to use vim or sed -i which does in place renames:

[root@ftp01 cron]# ls -la
total 3
drwx--   1 root root 2044 Jun 16 15:50 .
drwxr-xr-x. 10 root root  104 May 19 09:34 ..
-rw-r--r--   1 root root  300 Jun 16 15:50 file
-rw---   1 root root 2044 Jun 16 13:47 root
[root@ftp01 cron]# sed -i 's/^/#/' file
sed: cannot rename ./sedfB2CkO: Permission denied


Strangely, adding or deleting files works fine, it's only renaming that
fails.  And strangely I was able to successfully edit the file on ftp02:

[root@ftp02 cron]# sed -i 's/^/#/' file
[root@ftp02 cron]# ls -la
total 3
drwx--   1 root root 2044 Jun 16 15:49 .
drwxr-xr-x. 10 root root  104 May 19 09:34 ..
-rw-r--r--   1 root root  313 Jun 16 15:49 file
-rw---   1 root root 2044 Jun 16 13:47 root


Then it worked on ftp01 this time:
[root@ftp01 cron]# ls -la
total 3
drwx--   1 root root 2357 Jun 16 15:49 .
drwxr-xr-x. 10 root root  104 May 19 09:34 ..
-rw-r--r--   1 root root  313 Jun 16 15:49 file
-rw---   1 root root 2044 Jun 16 13:47 root


Then, I vim'd it successfully on ftp01... Then ran the sed again:

[root@ftp01 cron]# sed -i 's/^/#/' file
sed: cannot rename ./sedfB2CkO: Permission denied
[root@ftp01 cron]# ls -la
total 3
drwx--   1 root root 2044 Jun 16 15:51 .
drwxr-xr-x. 10 root root  104 May 19 09:34 ..
-rw-r--r--   1 root root  300 Jun 16 15:50 file
-rw---   1 root root 2044 Jun 16 13:47 root


And now we have the zero file problem again:

[root@ftp02 cron]# ls -la
total 2
drwx--   1 root root 2044 Jun 16 15:51 .
drwxr-xr-x. 10 root root  104 May 19 09:34 ..
-rw-r--r--   1 root root0 Jun 16 15:50 file
-rw---   1 root root 2044 Jun 16 13:47 root


Anyway, I wonder how much of this issue is related to that cannot rename
issue above.  Here are our security settings:

client.ftp01
key: 
caps: [mds] allow r, allow rw path=/ftp
caps: [mon] allow r
caps: [osd] allow rw pool=cephfs_metadata, allow rw pool=cephfs_data
client.ftp02
key: 
caps: [mds] allow r, allow rw path=/ftp
caps: [mon] allow r
caps: [osd] allow rw pool=cephfs_metadata, allow rw pool=cephfs_data


/ftp is the directory on cephfs under which cron lives; the full path is
/ftp/cron .

I hope this helps and thank you for your time!

Jason

On 6/15/16, 4:43 PM, "John Spray"  wrote:

>On Wed, Jun 15, 2016 at 10:21 PM, Jason Gress 
>wrote:
>> While trying to use CephFS as a clustered filesystem, we stumbled upon a
>> reproducible bug that is unfortunately pretty serious, as it leads to
>>data
>> loss.  Here is the situation:
>>
>> We have two systems, named ftp01 and ftp02.  They are both running
>>CentOS
>> 7.2, with this kernel release and ceph packages:
>>
>> kernel-3.10.0-327.18.2.el7.x86_64
>
>That is an old-ish kernel to be using with cephfs.  It may well be the
>source of your issues.
>
>> [root@ftp01 cron]# rpm -qa | grep ceph
>> ceph-base-10.2.1-0.el7.x86_64
>> ceph-deploy-1.5.33-0.noarch
>> ceph-mon-10.2.1-0.el7.x86_64
>> libcephfs1-10.2.1-0.el7.x86_64
>> ceph-selinux-10.2.1-0.el7.x86_64
>> ceph-mds-10.2.1-0.el7.x86_64
>> ceph-common-10.2.1-0.el7.x86_64
>> ceph-10.2.1-0.el7.x86_64
>> python-cephfs-10.2.1-0.el7.x86_64
>> ceph-osd-10.2.1-0.el7.x86_64
>>
>> Mounted like so:
>> XX.XX.XX.XX:/ftp/cron /var/spool/cron ceph
>> _netdev,relatime,name=ftp01,secretfile=/etc/ceph/ftp01.secret 0 0
>> And:
>> XX.XX.XX.XX:/ftp/cron /var/spool/cron ceph
>> _netdev,relatime,name=ftp02,secretfile=/etc/ceph/ftp02.secret 0 0
>>
>> This filesystem has 234GB worth of data on it, and I created another
>> subdirectory and mounted it, NFS style.
>>
>> Here were the steps to reproduce:
>>
>> First, I created a file (I was mounting /var/spool/cron on two systems)
>>on
>> ftp01:
>> (crond is not running right now on either system to keep the variables
>>down)
>>
>> [root@ftp01 cron]# cp /tmp/root .
>>
>> Shows up on both fine:
>> [root@ftp01 cron]# ls -la
>> total 2
>> drwx--   1 root root0 Jun 15 15:50 .
>> drwxr-xr-x. 10 root root  104 May 19 09:34 ..
>> -rw---   1 root root 2043 Jun 15 15:50 root
>> [root@ftp01 cron]# md5sum root
>> 0636c8deaeadfea7b9ddaa29652b43ae  root
>>
>> [root@ftp02 cron]# ls -la
>> total 2
>> drwx--   1 root root 2043 Jun 15 15:50 .
>> drwxr-xr-x. 10 root root  104 May 19 09:34 ..
>> -rw---   1 root root 2043 Jun 15 15:50 root
>> [root@ftp02 cron]# md5sum root
>> 0636c8deaeadfea7b9ddaa29652b43ae  root
>>
>>

Re: [ceph-users] rbd ioengine for fio

2016-06-16 Thread Somnath Roy
What is your fio script ?

Make sure you do this..

1. Run say ‘ceph-s’ from  the server you are trying to connect and see if it is 
connecting properly or not. If so, you don’t have any keyring issues.

2. Now, make sure you have given the following param value properly based on 
your setup.

pool=
rbdname=

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mavis 
Xiang
Sent: Thursday, June 16, 2016 1:47 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] rbd ioengine for fio

Hi all,
I am new to the rbd engine for fio, and ran into the following problems when i 
try to run a 4k write with my rbd image:



rbd_iodepth32: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, 
iodepth=32

fio-2.11-17-ga275

Starting 1 process

rbd engine: RBD version: 0.1.8

rados_connect failed.

fio_rbd_connect failed.



It seems that the rbd client cannot connect to the ceph cluster.

Ceph health output:

cluster e414604c-29d7-4adb-a889-7f70fc252dfa

 health HEALTH_WARN clock skew detected on mon.h02, mon.h05



But it should not affected the connection to the cluster.

Ceph.conf:

[global]

fsid = e414604c-29d7-4adb-a889-7f70fc252dfa

mon_initial_members = h02

mon_host = XXX.X.XXX.XXX

auth_cluster_required = cephx

auth_service_required = cephx

auth_client_required = cephx

filestore_xattr_use_omap = true

osd_pool_default_pg_num = 2400

osd_pool_default_pgp_num = 2400

public_network = XXX.X.XXX.X/21



[osd]

osd_crush_update_on_start = false






Should this be something about keyring? i did not find any options about 
keyring that can be set in fio file.
Can anyone please give some insights about this problem?
Any help would be appreciated!

Thanks!

Yu


PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph benchmark

2016-06-16 Thread Patrick McGarry
Moving this over to ceph-user where it’ll get the eyeballs you need.

On Mon, Jun 13, 2016 at 2:58 AM, Marcus Strasser
 wrote:
> Hello!
>
>
>
> I have a little test cluster with 2 server. Each Server have an osd with 800
> GB, there is a 10 Gbps Link between the servers.
>
> On a ceph-client i have configured a cephfs, mount kernelspace. The client
> is also connected with a 10 Gbps Link.
>
> All 3 use debian
>
> 4.5.5 kernel
>
> 64 GB mem
>
> There is no special configuration.
>
>
>
> Now the question:
>
> When i use the dd (~11GB) command in the cephfs mount, i get a result of 3
> GB/s
>
>
>
> dd if=/dev/zero of=/cephtest/test bs=1M count=10240
>
>
>
> Is it possble to transfer the data faster (use full capacity oft he network)
> and cache it with the memory?
>
>
>
> Thanks,
>
> Marcus Strasser
>
>
>
>
>
> Marcus Strasser
>
> Linux Systeme
>
> Russmedia IT GmbH
>
> A-6850 Schwarzach, Gutenbergstr. 1
>
>
>
> T +43 5572 501-872
>
> F +43 5572 501-97872
>
> marcus.stras...@highspeed.vol.at
>
> highspeed.vol.at
>
>
>
>



-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rbd ioengine for fio

2016-06-16 Thread Mavis Xiang
Hi all,
I am new to the rbd engine for fio, and ran into the following problems
when i try to run a 4k write with my rbd image:


rbd_iodepth32: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd,
iodepth=32

fio-2.11-17-ga275

Starting 1 process

rbd engine: RBD version: 0.1.8

rados_connect failed.

fio_rbd_connect failed.


It seems that the rbd client cannot connect to the ceph cluster.

Ceph health output:

cluster e414604c-29d7-4adb-a889-7f70fc252dfa

 health HEALTH_WARN clock skew detected on mon.h02, mon.h05



But it should not affected the connection to the cluster.

Ceph.conf:

[global]

fsid = e414604c-29d7-4adb-a889-7f70fc252dfa

mon_initial_members = h02

mon_host = XXX.X.XXX.XXX

auth_cluster_required = cephx

auth_service_required = cephx

auth_client_required = cephx

filestore_xattr_use_omap = true

osd_pool_default_pg_num = 2400

osd_pool_default_pgp_num = 2400

public_network = XXX.X.XXX.X/21


[osd]

osd_crush_update_on_start = false




Should this be something about keyring? i did not find any options about
keyring that can be set in fio file.
Can anyone please give some insights about this problem?
Any help would be appreciated!

Thanks!

Yu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Switches and latency

2016-06-16 Thread Gandalf Corvotempesta
2016-06-16 12:54 GMT+02:00 Oliver Dzombic :
> aside from the question of the coolness factor of Infinitiband,
> you should always also consider the question of replacing parts and
> extending cluster.
>
> A 10G Network environment is up to date currently, and will be for some
> more years. You can easily get equipment for it, and the pricing gets
> lower and lower. Also you can use that network environment also for
> other stuff ( if needed ) just to keep flexibility.
>
> With the IB stuff, you can only use it for one purpose. And you have a (
> very ) limited choice of options to get new parts.

I totally agree with you on this.
This is exactly my biggest concern, I don't know IB, IB hardware is hard to find
from my distributor and currently is not useble for anything else except for
storage.

A 10GB ethernet (even better a 10GBaseT) would be used for everything, even
for our local offices in case we need to change some parts. I can
dismiss one 10GB switch
from datacenter, replace it with a newer one and use the replaced one
in our offices.

The same would not be possible with IB as i'll have to redesign the
whole office network or similiar.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osds udev rules not triggered on reboot (jewel, jessie)

2016-06-16 Thread stephane.davy
Hi,

Same issue with Centos 7, I also put back this file in /etc/udev/rules.d. 

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Alexandre DERUMIER
Sent: Thursday, June 16, 2016 17:53
To: Karsten Heymann; Loris Cuoghi
Cc: Loic Dachary; ceph-users
Subject: Re: [ceph-users] osds udev rules not triggered on reboot (jewel, 
jessie)

Hi,

I have the same problem with osd disks not mounted at boot on jessie with ceph 
jewel

workaround is to re-add 60-ceph-partuuid-workaround.rules file to udev

http://tracker.ceph.com/issues/16351


- Mail original -
De: "aderumier" 
À: "Karsten Heymann" , "Loris Cuoghi" 

Cc: "Loic Dachary" , "ceph-users" 
Envoyé: Jeudi 28 Avril 2016 07:42:04
Objet: Re: [ceph-users] osds udev rules not triggered on reboot (jewel, jessie)

Hi, 
they are missing target files in debian packages 

http://tracker.ceph.com/issues/15573 
https://github.com/ceph/ceph/pull/8700 

I have also done some other trackers about packaging bug 

jewel: debian package: wrong /etc/default/ceph/ceph location 
http://tracker.ceph.com/issues/15587 

debian/ubuntu : TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES not specified in 
/etc/default/cep 
http://tracker.ceph.com/issues/15588 

jewel: debian package: init.d script bug 
http://tracker.ceph.com/issues/15585 


@CC loic dachary, maybe he could help to speed up packaging fixes 

- Mail original - 
De: "Karsten Heymann"  
À: "Loris Cuoghi"  
Cc: "ceph-users"  
Envoyé: Mercredi 27 Avril 2016 15:20:29 
Objet: Re: [ceph-users] osds udev rules not triggered on reboot (jewel, jessie) 

2016-04-27 15:18 GMT+02:00 Loris Cuoghi : 
> Le 27/04/2016 14:45, Karsten Heymann a écrit : 
>> one workaround I found was to add 
>> 
>> [Install] 
>> WantedBy=ceph-osd.target 
>> 
>> to /lib/systemd/system/ceph-disk@.service and then manually enable my 
>> disks with 
>> 
>> # systemctl enable ceph-disk\@dev-sdi1 
>> # systemctl start ceph-disk\@dev-sdi1 
>> 
>> That way they at least are started at boot time. 

> Great! But only if the disks keep their device names, right ? 

Exactly. It's just a little workaround until the real issue is fixed. 

+Karsten 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce 
message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
electroniques etant susceptibles d'alteration,
Orange decline toute responsabilite si ce message a ete altere, deforme ou 
falsifie. Merci.

This message and its attachments may contain confidential or privileged 
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete 
this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been 
modified, changed or falsified.
Thank you.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osds udev rules not triggered on reboot (jewel, jessie)

2016-06-16 Thread Alexandre DERUMIER
Hi,

I have the same problem with osd disks not mounted at boot on jessie with ceph 
jewel

workaround is to re-add 60-ceph-partuuid-workaround.rules file to udev

http://tracker.ceph.com/issues/16351


- Mail original -
De: "aderumier" 
À: "Karsten Heymann" , "Loris Cuoghi" 

Cc: "Loic Dachary" , "ceph-users" 
Envoyé: Jeudi 28 Avril 2016 07:42:04
Objet: Re: [ceph-users] osds udev rules not triggered on reboot (jewel, jessie)

Hi, 
they are missing target files in debian packages 

http://tracker.ceph.com/issues/15573 
https://github.com/ceph/ceph/pull/8700 

I have also done some other trackers about packaging bug 

jewel: debian package: wrong /etc/default/ceph/ceph location 
http://tracker.ceph.com/issues/15587 

debian/ubuntu : TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES not specified in 
/etc/default/cep 
http://tracker.ceph.com/issues/15588 

jewel: debian package: init.d script bug 
http://tracker.ceph.com/issues/15585 


@CC loic dachary, maybe he could help to speed up packaging fixes 

- Mail original - 
De: "Karsten Heymann"  
À: "Loris Cuoghi"  
Cc: "ceph-users"  
Envoyé: Mercredi 27 Avril 2016 15:20:29 
Objet: Re: [ceph-users] osds udev rules not triggered on reboot (jewel, jessie) 

2016-04-27 15:18 GMT+02:00 Loris Cuoghi : 
> Le 27/04/2016 14:45, Karsten Heymann a écrit : 
>> one workaround I found was to add 
>> 
>> [Install] 
>> WantedBy=ceph-osd.target 
>> 
>> to /lib/systemd/system/ceph-disk@.service and then manually enable my 
>> disks with 
>> 
>> # systemctl enable ceph-disk\@dev-sdi1 
>> # systemctl start ceph-disk\@dev-sdi1 
>> 
>> That way they at least are started at boot time. 

> Great! But only if the disks keep their device names, right ? 

Exactly. It's just a little workaround until the real issue is fixed. 

+Karsten 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CEPH with NVMe SSDs and Caching vs Journaling on SSDs

2016-06-16 Thread Tim Gipson
A few questions.

First, is there a good step by step to setting up a caching tier with NVMe SSDs 
that are on separate hosts?  Is that even possible?

Second, what sort of performance are people seeing from caching 
tiers/journaling on SSDs in Jewel?

Right now I am working on trying to find best practice for a CEPH cluster with 
3 monitor nodes, and 3 OSDs with 1 800GB NVMe drive and 12 6TB drives.

My goal is reliable/somewhat fast performance.

Any help would be greatly appreciated!

Tim Gipson
Systems Engineer

[http://www.ena.com/signature/enaemaillogo.gif]


618 Grassmere Park Drive, Suite 12
Nashville, TN 37211



website | blog | 
support


[http://www.ena.com/signature/facebook.png]
 [http://www.ena.com/signature/twitter.png] 
  
[http://www.ena.com/signature/linkedin.png] 
  
[http://www.ena.com/signature/youtube.png] 





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Down a osd and bring it Up

2016-06-16 Thread Oliver Dzombic
Hi,

the ceph dokumentation does currently not catch up with the development
of the software.

I advice you to check always the -for your OS- responsible runlevel files.

In case of Redhat 7, its systemd.

So in that case a

systemctl -a | grep ceph

will show you all available commands for ceph.


-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 16.06.2016 um 15:44 schrieb Kanchana. P:
> Hi,
> 
> How can I down a osd and bring it back in RHEL 7.2 with ceph verison 10.2.2
> 
> sudo start ceph-osd id=1 fails with “sudo: start: command not found”.
> 
> I have 5 osds in each node and i want to down one particular osd (sudo
> stop ceph-sd id=1 also fails) and see whether replicas are written to
> other osds without any issues.
> 
> Thanks in advance.
> 
> –kanchana.
> 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Down a osd and bring it Up

2016-06-16 Thread Joshua M. Boniface
RHEL 7.2 and Jewel should be using the systemd unit files by default, so you'd 
do something like:

> sudo systemctl stop ceph-osd@

and then

> sudo systemctl start ceph-osd@

when you're done.

--
Joshua M. Boniface
Linux System Ærchitect
Sigmentation fault. Core dumped.

On 16/06/16 09:44 AM, Kanchana. P wrote:
>
> Hi,
>
> How can I down a osd and bring it back in RHEL 7.2 with ceph verison 10.2.2
>
> sudo start ceph-osd id=1 fails with “sudo: start: command not found”.
>
> I have 5 osds in each node and i want to down one particular osd (sudo stop 
> ceph-sd id=1 also fails) and see whether replicas are written to other osds 
> without any issues.
>
> Thanks in advance.
>
> –kanchana.
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dramatic performance drop at certain number of objects in pool

2016-06-16 Thread Wade Holler
Blairo,

Thats right, I do see "lots" of READ IO!  If I compare the "bad
(330Mil)" pool, with the new test (good) pool:

iostat while running to the "good" pool shows almost all writes.
iostat while running to the "bad" pool has VERY large read spikes,
with almost no writes.

Sounds like you have an idea about what causes this.  I'm happy to hear it!

slabinfo is below.  Drop caches has no affect.

slabinfo - version: 2.1

# name   
 : tunables:
slabdata   

blk_io_mits 4674   4769   1664   198 : tunables00
  0 : slabdata251251  0

rpc_inode_cache0  0640   518 : tunables00
  0 : slabdata  0  0  0

t10_alua_tg_pt_gp_cache  0  0408   404 : tunables0
   00 : slabdata  0  0  0

t10_pr_reg_cache   0  0696   478 : tunables00
  0 : slabdata  0  0  0

se_sess_cache  0  0896   368 : tunables00
  0 : slabdata  0  0  0

kvm_vcpu   0  0  1625628 : tunables00
  0 : slabdata  0  0  0

kvm_mmu_page_header 48 48168   482 : tunables0
00 : slabdata  1  1  0

xfs_dqtrx  0  0528   628 : tunables00
  0 : slabdata  0  0  0

xfs_dquot  0  0472   698 : tunables00
  0 : slabdata  0  0  0

xfs_icr0  0144   562 : tunables00
  0 : slabdata  0  0  0

xfs_ili   96974261 97026835152   532 : tunables0
 00 : slabdata 1830695 1830695  0

xfs_inode 97120263 97120263   1088   308 : tunables0
 00 : slabdata 3237631 3237631  0

xfs_efd_item6280   6360400   404 : tunables00
  0 : slabdata159159  0

xfs_da_state3264   3264480   688 : tunables00
  0 : slabdata 48 48  0

xfs_btree_cur   1872   1872208   392 : tunables00
  0 : slabdata 48 48  0

xfs_log_ticket 23980  23980184   442 : tunables00
  0 : slabdata545545  0

scsi_cmd_cache  4536   4644448   364 : tunables00
  0 : slabdata129129  0

kcopyd_job 0  0   331298 : tunables00
  0 : slabdata  0  0  0

dm_uevent  0  0   2608   128 : tunables00
  0 : slabdata  0  0  0

dm_rq_target_io0  0136   602 : tunables00
  0 : slabdata  0  0  0

UDPLITEv6  0  0   1152   288 : tunables00
  0 : slabdata  0  0  0

UDPv6980980   1152   288 : tunables00
  0 : slabdata 35 35  0

tw_sock_TCPv6  0  0256   644 : tunables00
  0 : slabdata  0  0  0

TCPv6510510   2112   158 : tunables00
  0 : slabdata 34 34  0

uhci_urb_priv   6132   6132 56   731 : tunables00
  0 : slabdata 84 84  0

cfq_queue  64153  97300232   704 : tunables00
  0 : slabdata   1390   1390  0

bsg_cmd0  0312   524 : tunables00
  0 : slabdata  0  0  0

mqueue_inode_cache 36 36896   368 : tunables00
   0 : slabdata  1  1  0

hugetlbfs_inode_cache106106608   538 : tunables0
 00 : slabdata  2  2  0

configfs_dir_cache 46 46 88   461 : tunables00
   0 : slabdata  1  1  0

dquot  0  0256   644 : tunables00
  0 : slabdata  0  0  0

kioctx  1512   1512576   568 : tunables00
  0 : slabdata 27 27  0

userfaultfd_ctx_cache  0  0128   642 : tunables0
 00 : slabdata  0  0  0

pid_namespace  0  0   2176   158 : tunables00
  0 : slabdata  0  0  0

user_namespace 0  0280   584 : tunables00
  0 : slabdata  0  0  0

posix_timers_cache  0  0248   664 : tunables00
   0 : slabdata  0  0  0

UDP-Lite   0  0   1024   328 : tunables00
  0 : slabdata  0  0  0

RAW 1972   1972960   348 : tunables00
  0 : slabdata 58 58  0

UDP 1472   1504   1024   328 : tunables00
  0 : slabdata 47 47  0

tw_sock_TCP 6272   6400256   644 : tunables00
  0 : slabdata100100  0

TCP 5236   5457   1920   178 : tunables00
  0 : slabdata321321  0

blkdev_queue 421465   2088   158 : tunables00
  0 : slabdata 31 31  

Re: [ceph-users] Dramatic performance drop at certain number of objects in pool

2016-06-16 Thread Wade Holler
Blairo,

Thats right, I do see "lots" of READ IO!  If I compare the "bad (330Mil)"
pool, with the new test (good) pool:

iostat while running to the "good" pool shows almost all writes.
iostat while running to the "bad" pool has VERY large read spikes, with
almost no writes.

Sounds like you have an idea about what causes this.  I'm happy to hear it!

slabinfo is below.  Drop caches has no affect.

slabinfo - version: 2.1

# name   
 : tunables: slabdata
  

blk_io_mits 4674   4769   1664   198 : tunables000
: slabdata251251  0

rpc_inode_cache0  0640   518 : tunables000
: slabdata  0  0  0

t10_alua_tg_pt_gp_cache  0  0408   404 : tunables00
  0 : slabdata  0  0  0

t10_pr_reg_cache   0  0696   478 : tunables000
: slabdata  0  0  0

se_sess_cache  0  0896   368 : tunables000
: slabdata  0  0  0

kvm_vcpu   0  0  1625628 : tunables000
: slabdata  0  0  0

kvm_mmu_page_header 48 48168   482 : tunables000
: slabdata  1  1  0

xfs_dqtrx  0  0528   628 : tunables000
: slabdata  0  0  0

xfs_dquot  0  0472   698 : tunables000
: slabdata  0  0  0

xfs_icr0  0144   562 : tunables000
: slabdata  0  0  0

xfs_ili   96974261 97026835152   532 : tunables00
  0 : slabdata 1830695 1830695  0

xfs_inode 97120263 97120263   1088   308 : tunables00
  0 : slabdata 3237631 3237631  0

xfs_efd_item6280   6360400   404 : tunables000
: slabdata159159  0

xfs_da_state3264   3264480   688 : tunables000
: slabdata 48 48  0

xfs_btree_cur   1872   1872208   392 : tunables000
: slabdata 48 48  0

xfs_log_ticket 23980  23980184   442 : tunables000
: slabdata545545  0

scsi_cmd_cache  4536   4644448   364 : tunables000
: slabdata129129  0

kcopyd_job 0  0   331298 : tunables000
: slabdata  0  0  0

dm_uevent  0  0   2608   128 : tunables000
: slabdata  0  0  0

dm_rq_target_io0  0136   602 : tunables000
: slabdata  0  0  0

UDPLITEv6  0  0   1152   288 : tunables000
: slabdata  0  0  0

UDPv6980980   1152   288 : tunables000
: slabdata 35 35  0

tw_sock_TCPv6  0  0256   644 : tunables000
: slabdata  0  0  0

TCPv6510510   2112   158 : tunables000
: slabdata 34 34  0

uhci_urb_priv   6132   6132 56   731 : tunables000
: slabdata 84 84  0

cfq_queue  64153  97300232   704 : tunables000
: slabdata   1390   1390  0

bsg_cmd0  0312   524 : tunables000
: slabdata  0  0  0

mqueue_inode_cache 36 36896   368 : tunables000
: slabdata  1  1  0

hugetlbfs_inode_cache106106608   538 : tunables00
  0 : slabdata  2  2  0

configfs_dir_cache 46 46 88   461 : tunables000
: slabdata  1  1  0

dquot  0  0256   644 : tunables000
: slabdata  0  0  0

kioctx  1512   1512576   568 : tunables000
: slabdata 27 27  0

userfaultfd_ctx_cache  0  0128   642 : tunables00
  0 : slabdata  0  0  0

pid_namespace  0  0   2176   158 : tunables000
: slabdata  0  0  0

user_namespace 0  0280   584 : tunables000
: slabdata  0  0  0

posix_timers_cache  0  0248   664 : tunables000
: slabdata  0  0  0

UDP-Lite   0  0   1024   328 : tunables000
: slabdata  0  0  0

RAW 1972   1972960   348 : tunables000
: slabdata 58 58  0

UDP 1472   1504   1024   328 : tunables000
: slabdata 47 47  0

tw_sock_TCP 6272   6400256   644 : tunables000
: slabdata100100  0

TCP 5236   5457   1920   178 : tunables000
: slabdata321321  0

blkdev_queue 421465   2088   158 : tunables0  

Re: [ceph-users] Dramatic performance drop at certain number ofobjects in pool

2016-06-16 Thread Mykola
I see the same behavior with the threshold of around 20M objects for 4 nodes, 
16 OSDs, 32TB, hdd-based cluster. The issue dates back to hammer. 

Sent from my Windows 10 phone

From: Blair Bethwaite
Sent: Thursday, June 16, 2016 2:48 PM
To: Wade Holler
Cc: Ceph Development; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Dramatic performance drop at certain number ofobjects 
in pool

Hi Wade,

What IO are you seeing on the OSD devices when this happens (see e.g.
iostat), are there short periods of high read IOPS where (almost) no
writes occur? What does your memory usage look like (including slab)?

Cheers,

On 16 June 2016 at 22:14, Wade Holler  wrote:
> Hi All,
>
> I have a repeatable condition when the object count in a pool gets to
> 320-330 million the object write time dramatically and almost
> instantly increases as much as 10X, exhibited by fs_apply_latency
> going from 10ms to 100s of ms.
>
> Can someone point me in a direction / have an explanation ?
>
> I can add a new pool and it performs normally.
>
> Config is generally
> 3 Nodes 24 physical core each, 768GB Ram each, 16 OSD / node , all SSD
> with NVME for journals. Centos 7.2, XFS
>
> Jewell is the release; inserting objects with librados via some Python
> test code.
>
> Best Regards
> Wade
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Cheers,
~Blairo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Down a osd and bring it Up

2016-06-16 Thread Kanchana. P
Hi,

How can I down a osd and bring it back in RHEL 7.2 with ceph verison 10.2.2

sudo start ceph-osd id=1 fails with “sudo: start: command not found”.

I have 5 osds in each node and i want to down one particular osd (sudo stop
ceph-sd id=1 also fails) and see whether replicas are written to other osds
without any issues.

Thanks in advance.

–kanchana.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Switches and latency

2016-06-16 Thread Christian Balzer

Hello,

On Thu, 16 Jun 2016 12:44:51 +0200 Gandalf Corvotempesta wrote:

> 2016-06-16 3:53 GMT+02:00 Christian Balzer :
> > Gandalf, first read:
> > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg29546.html
> >
> > And this thread by Nick:
> > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg29708.html
> 
> Interesting reading. Thanks.
> 
> > Overly optimistic.
> > In an idle cluster with synthetic tests you might get sequential reads
> > that are around 150MB/s per HDD.
> > As for writes, think 80MB/s, again in an idle cluster.
> >
> > Any realistic, random I/O and you're looking at 50MB/s at most either
> > way.
> >
> > So your storage nodes can't really saturate even a single 10Gb/s link
> > in real life situations.
> 
> Ok.
> 
> > Journal SSDs can improve on things, but that's mostly for IOPS.
> > In fact they easily become the bottleneck bandwidth wise and are so on
> > most of my storage nodes.
> > Because you'd need at least 2 400GB DC S3710 SSDs to get around 1GB/s
> > writes, or one link worth.
> 
> I plan to use 1 or 2 SSD journal (probably, 1 SSD every 6 spinning disks)
> 
That's as large as I would make that failure domain, also make sure to
choose SSDs that work well with Ceph, endurance and sync write speed wise
(lots of threads about this).

> > Splitting things in cluster and public networks ONLY makes sense when
> > your storage node can saturate ALL the network bandwidth, which
> > usually is only the case when it comes to very expensive SSD/NVMe only
> > nodes.
> 
> This is not my case.
> 
> > Going back to your original post, with a split network the latency in
> > both networks counts the same, as a client write will NOT be
> > acknowledged until it has reach the journal of all replicas, so having
> > a higher latency cluster network is counterproductive.
> 
> Ok.
> 
> > Or if you can start with a clean slate (including the clients), look at
> > Infiniband.
> > All my production clusters are running entirely IB (IPoIB currently)
> > and I'm very happy with the performance, latency and cost.
> 
> Yes, i'll start with a brand new network.

Alas you're not really, as you say below your clients already have 10GigE
ports.
So I'll be terse from here.

> Acutally i'm testing with some old IB switches (DDR) and i'm not very
> happy, as IPoIB doesn't go over 8/9Gbit/s in a DDR. 
You should have gotten some insight out of your "RDMA/Infiniband status"
thread, but I never bothered with DDR.

> Additionally, CX4
> cables used by DDR are... HUGE and very "hard" to bend in the rack.
> I don't know if QDR cables are thinner.
> 
Nope, but then again some of the 10GigE cables are rather stiff or
fragile, too.

> Are you using QDR? 
Yes, because it's cheaper, we don't need the bandwidth and the latency is
supposedly lower than FDR.

> I've seen a couple of mellanox used switches on ebay
> that seems to be ok for me. 36 QDR ports would be awesome but I don't
> have any IB knowledge.
Largest switch we use is 18 ports, our clusters are small.

> Could I keep the IB fabric unconfigured and use only IPoIB ?
Pretty much, a very basic OpenSM config and you're good to go.

> I can create a bonded (failover) IPoIB device on each node and add 2 or
> more IB cables between both switches. In a normal Ethernet network,
> these 2 cables must be joined in a LAG to avoid loops. Is infiniband
> able to manage this on their
> own ? 
Yes. The basic/default OpenSM router needs to be told to use more than one
path if possible "lmc 2", other routers are also available and have more bells 
and
whistles.


>I've never find a way to aggragate multiple ports.
Re-read that OSPF thread...
IB has means to do this, alas IPoIB bonding only supports failover at this
time, yes.

> The real drawback with IB is that I have to add IB cards on each compute
> nodes, where my current compute nodes a 2 10GBaseT ports onboard.
> 
> This add some costs
> 
Then look at the 10GigE options I listed.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Day Switzerland slides and video

2016-06-16 Thread Gregory Farnum
On Wed, Jun 15, 2016 at 11:30 AM, Dan van der Ster  wrote:
> Dear Ceph Community,
>
> Yesterday we had the pleasure of hosting Ceph Day Switzerland, and we
> wanted to let you know that the slides and videos of most talks have
> been posted online:
>
>   https://indico.cern.ch/event/542464/timetable/
>
> Thanks again to all the speakers and attendees!

These are really useful; thanks for sharing!
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs not coming up on one host

2016-06-16 Thread Gregory Farnum
On Wed, Jun 15, 2016 at 10:21 AM, Kostis Fardelas  wrote:
> Hello Jacob, Gregory,
>
> did you manage to start up those OSDs at last? I came across a very
> much alike incident [1] (no flags preventing the OSDs from getting UP
> in the cluster though, no hardware problems reported) and I wonder if
> you found out what was the culprit in your case.
>
> [1] http://permalink.gmane.org/gmane.comp.file-systems.ceph.user/30432

Nope, never heard back. That said, it's not clear from your
description if these are actually the same problem; if they are you
need to provide monitor logs before anybody can help. If they aren't,
you are skipping steps and need to include OSD logs and things. ;)
-Greg

>
> Best regards,
> Kostis
>
> On 17 April 2015 at 02:04, Gregory Farnum  wrote:
>> The monitor looks like it's not generating a new OSDMap including the
>> booting OSDs. I could say with more certainty what's going on with the
>> monitor log file, but I'm betting you've got one of the noin or noup
>> family of flags set. I *think* these will be output in "ceph -w" or in
>> "ceph osd dump", although I can't say for certain in Firefly.
>> -Greg
>>
>> On Fri, Apr 10, 2015 at 1:57 AM, Jacob Reid  
>> wrote:
>>> On Fri, Apr 10, 2015 at 09:55:20AM +0100, Jacob Reid wrote:
 On Thu, Apr 09, 2015 at 05:21:47PM +0100, Jacob Reid wrote:
 > On Thu, Apr 09, 2015 at 08:46:07AM -0700, Gregory Farnum wrote:
 > > On Thu, Apr 9, 2015 at 8:14 AM, Jacob Reid 
 > >  wrote:
 > > > On Thu, Apr 09, 2015 at 06:43:45AM -0700, Gregory Farnum wrote:
 > > >> You can turn up debugging ("debug osd = 10" and "debug filestore = 
 > > >> 10"
 > > >> are probably enough, or maybe 20 each) and see what comes out to get
 > > >> more information about why the threads are stuck.
 > > >>
 > > >> But just from the log my answer is the same as before, and now I 
 > > >> don't
 > > >> trust that controller (or maybe its disks), regardless of what it's
 > > >> admitting to. ;)
 > > >> -Greg
 > > >>
 > > >
 > > > Ran with osd and filestore debug both at 20; still nothing jumping 
 > > > out at me. Logfile attached as it got huge fairly quickly, but 
 > > > mostly seems to be the same extra lines. I tried running some test 
 > > > I/O on the drives in question to try and provoke some kind of 
 > > > problem, but they seem fine now...
 > >
 > > Okay, this is strange. Something very wonky is happening with your
 > > scheduler — it looks like these threads are all idle, and they're
 > > scheduling wakeups that handle an appreciable amount of time after
 > > they're supposed to. For instance:
 > > 2015-04-09 15:56:55.953116 7f70a7963700 20
 > > filestore(/var/lib/ceph/osd/osd.15) sync_entry woke after 5.416704
 > > 2015-04-09 15:56:55.953153 7f70a7963700 20
 > > filestore(/var/lib/ceph/osd/osd.15) sync_entry waiting for
 > > max_interval 5.00
 > >
 > > This is the thread that syncs your backing store, and it always sets
 > > itself to get woken up at 5-second intervals — but here it took >5.4
 > > seconds, and later on in your log it takes more than 6 seconds.
 > > It looks like all the threads which are getting timed out are also
 > > idle, but are taking so much longer to wake up than they're set for
 > > that they get a timeout warning.
 > >
 > > There might be some bugs in here where we're expecting wakeups to be
 > > more precise than they can be, but these sorts of misses are
 > > definitely not normal. Is this server overloaded on the CPU? Have you
 > > done something to make the scheduler or wakeups wonky?
 > > -Greg
 >
 > CPU load is minimal - the host does nothing but run OSDs and has 8 cores 
 > that are all sitting idle with a load average of 0.1. I haven't done 
 > anything to scheduling. That was with the debug logging on, if that 
 > could be the cause of any delays. A scheduler issue seems possible - I 
 > haven't done anything to it, but `time sleep 5` run a few times returns 
 > anything spread randomly from 5.002 to 7.1(!) seconds but mostly in the 
 > 5.5-6.0 region where it managed fairly consistently <5.2 on the other 
 > servers in the cluster and <5.02 on my desktop. I have disabled the CPU 
 > power saving mode as the only thing I could think of that might be 
 > having an effect on this, and running the same test again gives more 
 > sane results... we'll see if this reflects in the OSD logs or not, I 
 > guess. If this is the cause, it's probably something that the next 
 > version might want to make a specific warning case of detecting. I will 
 > keep you updated as to their behaviour now...
 > ___
 > ceph-users mailing list
 > ceph-users@lists.ceph.com
 > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 Overnight, nothing chang

Re: [ceph-users] Ceph file change monitor

2016-06-16 Thread Gregory Farnum
On Wed, Jun 15, 2016 at 5:19 AM, siva kumar <85s...@gmail.com> wrote:
> Yes , We need to similar to inotify/fanotity .
>
> came through link
> http://docs.ceph.com/docs/master/dev/osd_internals/watch_notify/?highlight=notify#watch-notify
>
> Just want to know if i can use this ?
>
> If yes means how we have to use ?

Not really. Watch-notify is a feature of RADOS and requires
registering explicit watches on each object you'd like to see. inotify
for CephFS would presumably be implemented in the MDS (mostly) and
watch-notify is unlikely to be one of its building blocks.

I think this would be an interesting project for somebody, but it's
not a trivial one and depending on exactly what events inotify offers
it could get very complicated. (For instance, doing it properly would
probably require cooperative clients notifying the MDS when they've
performed certain actions.)
-Greg

>
> Thanks,
> Siva
>
> On Thu, Jun 9, 2016 at 6:06 PM, Anand Bhat  wrote:
>>
>> I think you are looking for inotify/fanotify events for Ceph. Usually
>> these are implemented for local file system. Ceph being a networked file
>> system, it will not be easy to implement  and will involve network traffic
>> to generate events.
>>
>> Not sure it is in the plan though.
>>
>> Regards,
>> Anand
>>
>> On Wed, Jun 8, 2016 at 2:46 PM, John Spray  wrote:
>>>
>>> On Wed, Jun 8, 2016 at 8:40 AM, siva kumar <85s...@gmail.com> wrote:
>>> > Dear Team,
>>> >
>>> > We are using ceph storage & cephFS for mounting .
>>> >
>>> > Our configuration :
>>> >
>>> > 3 osd
>>> > 3 monitor
>>> > 4 clients .
>>> > ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
>>> >
>>> > We would like to get file change notifications like what is the event
>>> > (ADDED, MODIFIED,DELETED) and for which file the event has occurred.
>>> > These
>>> > notifications should be sent to our server.
>>> > How to get these notifications?
>>>
>>> This isn't a feature that CephFS has right now.  Still, I would be
>>> interested to know what protocol/format your server would consume
>>> these kinds of notifications in?
>>>
>>> John
>>>
>>> > Ultimately we would like to add our custom file watch notification
>>> > hooks to
>>> > ceph so that we can handle this notifications by our self .
>>> >
>>> > Additional Info :
>>> >
>>> > [test@ceph-zclient1 ~]$ ceph -s
>>> >
>>> >> cluster a8c92ae6-6842-4fa2-bfc9-8cdefd28df5c
>>> >
>>> >  health HEALTH_WARN
>>> > mds0: ceph-client1 failing to respond to cache pressure
>>> > mds0: ceph-client2 failing to respond to cache pressure
>>> > mds0: ceph-client3 failing to respond to cache pressure
>>> > mds0: ceph-client4 failing to respond to cache pressure
>>> >  monmap e1: 3 mons at
>>> >
>>> > {ceph-zadmin=xxx.xxx.xxx.xxx:6789/0,ceph-zmonitor=xxx.xxx.xxx.xxx:6789/0,ceph-zmonitor1=xxx.xxx.xxx.xxx:6789/0}
>>> > election epoch 16, quorum 0,1,2
>>> > ceph-zadmin,ceph-zmonitor1,ceph-zmonitor
>>> >  mdsmap e52184: 1/1/1 up {0=ceph-zstorage1=up:active}
>>> >  osdmap e3278: 3 osds: 3 up, 3 in
>>> >   pgmap v5068139: 384 pgs, 3 pools, 518 GB data, 7386 kobjects
>>> > 1149 GB used, 5353 GB / 6503 GB avail
>>> >  384 active+clean
>>> >
>>> >   client io 1259 B/s rd, 179 kB/s wr, 11 op/s
>>> >
>>> >
>>> >
>>> > Thanks,
>>> > S.Sivakumar
>>> >
>>> >
>>> >
>>> >
>>> > ___
>>> > ceph-users mailing list
>>> > ceph-users@lists.ceph.com
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>>
>> --
>>
>> 
>> Never say never.
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dramatic performance drop at certain number of objects in pool

2016-06-16 Thread Blair Bethwaite
Hi Wade,

What IO are you seeing on the OSD devices when this happens (see e.g.
iostat), are there short periods of high read IOPS where (almost) no
writes occur? What does your memory usage look like (including slab)?

Cheers,

On 16 June 2016 at 22:14, Wade Holler  wrote:
> Hi All,
>
> I have a repeatable condition when the object count in a pool gets to
> 320-330 million the object write time dramatically and almost
> instantly increases as much as 10X, exhibited by fs_apply_latency
> going from 10ms to 100s of ms.
>
> Can someone point me in a direction / have an explanation ?
>
> I can add a new pool and it performs normally.
>
> Config is generally
> 3 Nodes 24 physical core each, 768GB Ram each, 16 OSD / node , all SSD
> with NVME for journals. Centos 7.2, XFS
>
> Jewell is the release; inserting objects with librados via some Python
> test code.
>
> Best Regards
> Wade
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Cheers,
~Blairo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How can I make daemon for ceph-dash

2016-06-16 Thread Kanchana. P
Hi,

When a rgw service is started, by default below pools are created.
.rgw.root
default.rgw.control
default.rgw.data.root
default.rgw.gc
default.rgw.log

When  a swift user is created, some default pools are created. But I would
like to use "Pool_A" for the swift user.
>From client when I run Cosbench the data should be placed in "Pool_A"
instead of placing it in default pools. How can i achieve that.

Also need help on how to run Cosbench with swift user. Your help is very
much appreciated.

​Thanks,
kanchana.​
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Performance drop when object count in a pool hits a threshold

2016-06-16 Thread Wade Holler
Hi All,

I have a repeatable condition when the object count in a pool gets to
320-330 million the object write time dramatically and almost instantly
increases as much as 10X, exhibited by fs_apply_latency going from 10ms to
100s of ms.

Can someone point me in a direction / have an explanation ?

I can add a new pool and it performs normally.

Config is generally
3 Nodes 24 physical core each, 768GB Ram each, 16 OSD / node , all SSD with
NVME for journals. Centos 7.2, XFS

Jewell is the release; inserting objects with librados via some Python test
code.

Best Regards
Wade
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v10.2.2 Jewel released

2016-06-16 Thread Oliver Dzombic
Hi,

thank you for the release !


http://docs.ceph.com/docs/master/_downloads/v10.2.2.txt
->
404 Not Found
nginx/1.4.6 (Ubuntu)

http://docs.ceph.com/docs/master/release-notes/#v10-2-2-jewel
links
http://docs.ceph.com/docs/master/_downloads/v10.2.1.txt

For me the more detailed changelog would be very intresting.

Thank you !

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 15.06.2016 um 21:10 schrieb Sage Weil:
> This point release fixes several important bugs in RBD mirroring, RGW 
> multi-site, CephFS, and RADOS.
> 
> We recommend that all v10.2.x users upgrade.
> 
> For more detailed information, see the release notes at
> 
> http://docs.ceph.com/docs/master/release-notes/#v10-2-2-jewel
> 
> or the complete changelog at
> 
> http://docs.ceph.com/docs/master/_downloads/v10.2.2.txt
> 
> Getting Ceph
> 
> 
> * Git at git://github.com/ceph/ceph.git
> * Tarball at http://download.ceph.com/tarballs/ceph-10.2.2.tar.gz
> * For packages, see http://ceph.com/docs/master/install/get-packages
> * For ceph-deploy, see http://ceph.com/docs/master/install/install-ceph-deploy
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Switches and latency

2016-06-16 Thread Oliver Dzombic
Hi,

aside from the question of the coolness factor of Infinitiband,
you should always also consider the question of replacing parts and
extending cluster.

A 10G Network environment is up to date currently, and will be for some
more years. You can easily get equipment for it, and the pricing gets
lower and lower. Also you can use that network environment also for
other stuff ( if needed ) just to keep flexibility.

With the IB stuff, you can only use it for one purpose. And you have a (
very ) limited choice of options to get new parts.

So, from the point of flexibility and the cost/gain ratio, i dont see
where IB will do a good job for you in the long shot.

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 16.06.2016 um 12:44 schrieb Gandalf Corvotempesta:
> 2016-06-16 3:53 GMT+02:00 Christian Balzer :
>> Gandalf, first read:
>> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg29546.html
>>
>> And this thread by Nick:
>> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg29708.html
> 
> Interesting reading. Thanks.
> 
>> Overly optimistic.
>> In an idle cluster with synthetic tests you might get sequential reads
>> that are around 150MB/s per HDD.
>> As for writes, think 80MB/s, again in an idle cluster.
>>
>> Any realistic, random I/O and you're looking at 50MB/s at most either way.
>>
>> So your storage nodes can't really saturate even a single 10Gb/s link in
>> real life situations.
> 
> Ok.
> 
>> Journal SSDs can improve on things, but that's mostly for IOPS.
>> In fact they easily become the bottleneck bandwidth wise and are so on
>> most of my storage nodes.
>> Because you'd need at least 2 400GB DC S3710 SSDs to get around 1GB/s
>> writes, or one link worth.
> 
> I plan to use 1 or 2 SSD journal (probably, 1 SSD every 6 spinning disks)
> 
>> Splitting things in cluster and public networks ONLY makes sense when your
>> storage node can saturate ALL the network bandwidth, which usually is only
>> the case when it comes to very expensive SSD/NVMe only nodes.
> 
> This is not my case.
> 
>> Going back to your original post, with a split network the latency in both
>> networks counts the same, as a client write will NOT be acknowledged until
>> it has reach the journal of all replicas, so having a higher latency
>> cluster network is counterproductive.
> 
> Ok.
> 
>> Or if you can start with a clean slate (including the clients), look at
>> Infiniband.
>> All my production clusters are running entirely IB (IPoIB currently) and
>> I'm very happy with the performance, latency and cost.
> 
> Yes, i'll start with a brand new network.
> Acutally i'm testing with some old IB switches (DDR) and i'm not very
> happy, as IPoIB doesn't go over 8/9Gbit/s in a DDR. Additionally, CX4
> cables used by DDR are... HUGE and very "hard" to bend in the rack.
> I don't know if QDR cables are thinner.
> 
> Are you using QDR? I've seen a couple of mellanox used switches on ebay
> that seems to be ok for me. 36 QDR ports would be awesome but I don't
> have any IB knowledge.
> Could I keep the IB fabric unconfigured and use only IPoIB ?
> I can create a bonded (failover) IPoIB device on each node and add 2 or more
> IB cables between both switches. In a normal Ethernet network, these 2 cables
> must be joined in a LAG to avoid loops. Is infiniband able to manage
> this on their
> own ? I've never find a way to aggragate multiple ports.
> 
> The real drawback with IB is that I have to add IB cards on each compute 
> nodes,
> where my current compute nodes a 2 10GBaseT ports onboard.
> 
> This add some costs
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Switches and latency

2016-06-16 Thread Gandalf Corvotempesta
2016-06-16 3:53 GMT+02:00 Christian Balzer :
> Gandalf, first read:
> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg29546.html
>
> And this thread by Nick:
> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg29708.html

Interesting reading. Thanks.

> Overly optimistic.
> In an idle cluster with synthetic tests you might get sequential reads
> that are around 150MB/s per HDD.
> As for writes, think 80MB/s, again in an idle cluster.
>
> Any realistic, random I/O and you're looking at 50MB/s at most either way.
>
> So your storage nodes can't really saturate even a single 10Gb/s link in
> real life situations.

Ok.

> Journal SSDs can improve on things, but that's mostly for IOPS.
> In fact they easily become the bottleneck bandwidth wise and are so on
> most of my storage nodes.
> Because you'd need at least 2 400GB DC S3710 SSDs to get around 1GB/s
> writes, or one link worth.

I plan to use 1 or 2 SSD journal (probably, 1 SSD every 6 spinning disks)

> Splitting things in cluster and public networks ONLY makes sense when your
> storage node can saturate ALL the network bandwidth, which usually is only
> the case when it comes to very expensive SSD/NVMe only nodes.

This is not my case.

> Going back to your original post, with a split network the latency in both
> networks counts the same, as a client write will NOT be acknowledged until
> it has reach the journal of all replicas, so having a higher latency
> cluster network is counterproductive.

Ok.

> Or if you can start with a clean slate (including the clients), look at
> Infiniband.
> All my production clusters are running entirely IB (IPoIB currently) and
> I'm very happy with the performance, latency and cost.

Yes, i'll start with a brand new network.
Acutally i'm testing with some old IB switches (DDR) and i'm not very
happy, as IPoIB doesn't go over 8/9Gbit/s in a DDR. Additionally, CX4
cables used by DDR are... HUGE and very "hard" to bend in the rack.
I don't know if QDR cables are thinner.

Are you using QDR? I've seen a couple of mellanox used switches on ebay
that seems to be ok for me. 36 QDR ports would be awesome but I don't
have any IB knowledge.
Could I keep the IB fabric unconfigured and use only IPoIB ?
I can create a bonded (failover) IPoIB device on each node and add 2 or more
IB cables between both switches. In a normal Ethernet network, these 2 cables
must be joined in a LAG to avoid loops. Is infiniband able to manage
this on their
own ? I've never find a way to aggragate multiple ports.

The real drawback with IB is that I have to add IB cards on each compute nodes,
where my current compute nodes a 2 10GBaseT ports onboard.

This add some costs
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs stuck in booting state after redeploying

2016-06-16 Thread Kostis Fardelas
Answering to myself and to whom may be interested. After some strace
and better looking in logs, I realized that the cluster knew different
fsids for my redeployed OSDs, so I realized that I did not 'rm' the
OSDs before readding them to the cluster

So the fact is that ceph does not update OSD fsids from redeployed
OSDs, even after removing the old ones from crushmap. You need to rm
them

Regards,
Kostis

On 15 June 2016 at 17:14, Kostis Fardelas  wrote:
> Hello,
> in the process of redeploying some OSDs in our cluster, after
> destroying one of them (down, out, remove from crushmap) and trying to
> redeploy it (crush add ,start), we reach a state where the OSD gets
> stuck at booting state:
> root@staging-rd0-02:~# ceph daemon osd.12 status
> { "cluster_fsid": "XXX",
>   "osd_fsid": "XX",
>   "whoami": 12,
>   "state": "booting",
>   "oldest_map": 150201,
>   "newest_map": 150779,
>   "num_pgs": 0}
>
> No flags that could prevent the OSD to get up is in place. The OSD
> never gets marked as up in 'ceph osd tree' and never gets in. If I try
> to manual get it in, it gets out after a while. The cluster OSD map
> keeps going forward, but the OSD cannot catch-up of course. I started
> the OSD with debugging options:
> debug osd = 20
> debug filestore = 20
> debug journal = 20
> debug monc = 20
> debug ms = 1
>
> and what I see is contiuning OSD logs of this kind:
> 2016-06-15 16:39:33.876339 7f0256b61700 10 osd.12 150798 do_waiters -- start
> 2016-06-15 16:39:33.876343 7f0256b61700 10 osd.12 150798 do_waiters -- finish
> 2016-06-15 16:39:34.390560 7f022e2ee700 20 osd.12 150798
> update_osd_stat osd_stat(59384 kB used, 558 GB avail, 558 GB total,
> peers []/[] op hist [])
> 2016-06-15 16:39:34.390622 7f022e2ee700  5 osd.12 150798 heartbeat:
> osd_stat(59384 kB used, 558 GB avail, 558 GB total, peers []/[] op
> hist [])
> 2016-06-15 16:39:34.876526 7f0256b61700  5 osd.12 150798 tick
> 2016-06-15 16:39:34.876561 7f0256b61700 10 osd.12 150798 do_waiters -- start
> 2016-06-15 16:39:34.876565 7f0256b61700 10 osd.12 150798 do_waiters -- finish
> 2016-06-15 16:39:35.876729 7f0256b61700  5 osd.12 150798 tick
> 2016-06-15 16:39:35.876762 7f0256b61700 10 osd.12 150798 do_waiters -- start
> 2016-06-15 16:39:35.876766 7f0256b61700 10 osd.12 150798 do_waiters -- finish
> 2016-06-15 16:39:36.646355 7f025535e700 20
> filestore(/rados/staging-rd0-02-12) sync_entry woke after 30.000161
> 2016-06-15 16:39:36.646421 7f025535e700 20
> filestore(/rados/staging-rd0-02-12) sync_entry waiting for
> max_interval 30.00
> 2016-06-15 16:39:36.876917 7f0256b61700  5 osd.12 150798 tick
> 2016-06-15 16:39:36.876949 7f0256b61700 10 osd.12 150798 do_waiters -- start
> 2016-06-15 16:39:36.876953 7f0256b61700 10 osd.12 150798 do_waiters -- finish
> 2016-06-15 16:39:37.877112 7f0256b61700  5 osd.12 150798 tick
> 2016-06-15 16:39:37.877142 7f0256b61700 10 osd.12 150798 do_waiters -- start
> 2016-06-15 16:39:37.877147 7f0256b61700 10 osd.12 150798 do_waiters -- finish
> 2016-06-15 16:39:38.877298 7f0256b61700  5 osd.12 150798 tick
> 2016-06-15 16:39:38.877327 7f0256b61700 10 osd.12 150798 do_waiters -- start
> 2016-06-15 16:39:38.877331 7f0256b61700 10 osd.12 150798 do_waiters -- finish
>
> Is there a solution for this problem? Known bug? We are on firefly
> (0.80.11) and wanted to do some maintenance before going to hammer,
> but now we are somewhat stuck.
>
> Best regards,
> Kostis
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] strange behavior using resize2fs vm image on rbd pool

2016-06-16 Thread ceph
resize2fs is some kind of incremental, I guess

You may notice that on a slow system, if you give many more spaces to a
large partition

Running resize2fs on a screen, and watching df -h on an other will show
you an incremental increase of disk space


Maybe the discard option can help you on that case, if it's really an
issue, and if your software support it

man ext4 said:
>discard/nodiscard
>   Controls  whether ext4 should issue discard/TRIM commands to the
>   underlying block device when blocks are freed.  This  is  useful
>   for  SSD  devices  and sparse/thinly-provisioned LUNs, but it is
>   off by default until sufficient testing has been done.




On 16/06/2016 12:24, Zhongyan Gu wrote:
> Hi,
> it seems using resize2fs on rbd image would generate lots of garbage
> objects in ceph.
> The experiment is:
> 1. use resize2fs to extent 50G rbd image A to 400G image with ext4 format
> in vm.
> 2. calculate the total object size in rbd pool, 35GB(already divided by
> replicas#).
> 3. cone ImageB based on 400G image A. then flatten Image B.
> 4. after flatten, calculate the total object size in rbd pool and Image B's
> actual size is 14GB.
> 
> I'm confused why Image B size is 14GB, not the same as Image A.
> The only possible way that can explain that is resize2fs generate a lot of
> garbage objects in rbd. And flatten ignored those garbage files.
> Anyone can help me confirm this??
> 
> 
> thanks
> Cory
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] strange behavior using resize2fs vm image on rbd pool

2016-06-16 Thread Zhongyan Gu
Hi,
it seems using resize2fs on rbd image would generate lots of garbage
objects in ceph.
The experiment is:
1. use resize2fs to extent 50G rbd image A to 400G image with ext4 format
in vm.
2. calculate the total object size in rbd pool, 35GB(already divided by
replicas#).
3. cone ImageB based on 400G image A. then flatten Image B.
4. after flatten, calculate the total object size in rbd pool and Image B's
actual size is 14GB.

I'm confused why Image B size is 14GB, not the same as Image A.
The only possible way that can explain that is resize2fs generate a lot of
garbage objects in rbd. And flatten ignored those garbage files.
Anyone can help me confirm this??


thanks
Cory
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD Stripe/Chunk Size (Order Number) Pros Cons

2016-06-16 Thread Mark Nelson



On 06/16/2016 03:54 AM, Mark Nelson wrote:

Hi,

larger stripe size (to an extent) will generally improve large
sequential read and write performance.


Oops, I should have had my coffee. I missed a sentence here.  larger 
strip size will generally improve large sequential read and write 
performance.  Smaller stripe size can provide some of the advantages you 
mention below, but there's overhead though.  Ok fixed, now back to find 
coffee. :)



There's overhead though.  It
means more objects which can slow things down at the filestore level
when PG splits occur and also potentially means more inodes fall out of
cache, longer syncfs, etc.  On the other hand, if using cache tiering,
smaller objects means less data to promote which can be a big win for
small IO.

Basically the answer is that there are pluses and minuses, and the exact
behavior will depend on your kernel configuration, hardware, and use
case.  I think 4MB has been a fairly good default thus far (might change
with bluestore), but tuning for a specific use case may mean a smaller
or larger size is better.

Mark

On 06/16/2016 03:20 AM, Lazuardi Nasution wrote:

Hi,

I'm looking for some pros cons related to RBD stripe/chunk size
indicated by image order number. Default is 4MB (order 22), but
OpenStack use 8MB (order 23) as default. What if we use smaller size
(lower order number), isn't it more chance that image objects is
spreaded through OSDs and cached on OSD nodes RAM? What if we use bigger
size (higher order number), isn't it more chance that image objects is
cached as contiguos blocks and may be have read ahead advantage? Please
give your opnion and reason.

Best regards,



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD Stripe/Chunk Size (Order Number) Pros Cons

2016-06-16 Thread Mark Nelson

Hi,

larger stripe size (to an extent) will generally improve large 
sequential read and write performance.  There's overhead though.  It 
means more objects which can slow things down at the filestore level 
when PG splits occur and also potentially means more inodes fall out of 
cache, longer syncfs, etc.  On the other hand, if using cache tiering, 
smaller objects means less data to promote which can be a big win for 
small IO.


Basically the answer is that there are pluses and minuses, and the exact 
behavior will depend on your kernel configuration, hardware, and use 
case.  I think 4MB has been a fairly good default thus far (might change 
with bluestore), but tuning for a specific use case may mean a smaller 
or larger size is better.


Mark

On 06/16/2016 03:20 AM, Lazuardi Nasution wrote:

Hi,

I'm looking for some pros cons related to RBD stripe/chunk size
indicated by image order number. Default is 4MB (order 22), but
OpenStack use 8MB (order 23) as default. What if we use smaller size
(lower order number), isn't it more chance that image objects is
spreaded through OSDs and cached on OSD nodes RAM? What if we use bigger
size (higher order number), isn't it more chance that image objects is
cached as contiguos blocks and may be have read ahead advantage? Please
give your opnion and reason.

Best regards,



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD Stripe/Chunk Size (Order Number) Pros Cons

2016-06-16 Thread Lazuardi Nasution
Hi,

I'm looking for some pros cons related to RBD stripe/chunk size indicated
by image order number. Default is 4MB (order 22), but OpenStack use 8MB
(order 23) as default. What if we use smaller size (lower order number),
isn't it more chance that image objects is spreaded through OSDs and cached
on OSD nodes RAM? What if we use bigger size (higher order number), isn't
it more chance that image objects is cached as contiguos blocks and may be
have read ahead advantage? Please give your opnion and reason.

Best regards,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com