Re: [ceph-users] High 0.94.5 OSD memory use at 8GB RAM/TB raw disk during recovery

2015-12-01 Thread Ross Annetts

Hi,

 May be a bit late now, but if you are adding additional capacity I 
would recommend adding one OSD at a time. If you want even more control 
add the OSD with a weight of 0 and gradually increase. This limits the 
amount of data that needs to be moved and resources required.


Regards,
Ross

On 1/12/2015 11:59 AM, Mark Nelson wrote:

Oh, forgot to ask, any core dumps?

Mark

On 11/30/2015 06:58 PM, Mark Nelson wrote:

Hi Laurent,

Wow, that's excessive!  I'd see if anyone else has any tricks first, but
if nothing else helps, running an OSD through valgrind with massif will
probably help pinpoint what's going on.  Have you tweaked the recovery
tunables at all?

Mark

On 11/30/2015 06:52 PM, Laurent GUERBY wrote:

Hi,

We lost a disk today in our ceph cluster so we added a new machine with
4 disks to replace the capacity and we activated straw1 tunable too
(we also tried straw2 but we quickly backed up this change).

During recovery OSD started crashing on all of our machines
the issue being OSD RAM usage that goes very high, eg:

24078 root  20   0 27.784g 0.026t  10888 S   5.9 84.9
16:23.63 /usr/bin/ceph-osd --cluster=ceph -i 41 -f
/dev/sda1   2.7T  2.2T  514G  82% /var/lib/ceph/osd/ceph-41

That's about 8GB resident RAM per TB of disk, way above
what we provisionned ~ 2-4 GB RAM/TB.

We rebuilt 0.94.5 with the three memory related commits below but
it didn't change anything.

Right now our cluster is unable to fully restart and recover with the
machines and RAM we have been working with for the past year.

Any idea on what to look for?

Thanks in advance,

Sincerely,

Laurent

commit 296bec72649884447b59e785c345c53994df9e09
Author: xiexingguo <258156...@qq.com>
Date:   Mon Oct 26 18:38:01 2015 +0800

 FileStore: potential memory leak if _fgetattrs fails

 Memory leak happens if _fgetattrs encounters some error and simply
returns.
 Fixes: #13597
 Signed-off-by: xie xingguo 

 (cherry picked from commit 
ace7dd096b58a88e25ce16f011aed09269f2a2b4)


commit 16aa14ab0208df568e64e2a4f7fe7692eaf6b469
Author: Xinze Chi 
Date:   Sun Aug 2 18:36:40 2015 +0800

 bug fix: osd: do not cache unused buffer in attrs

 attrs only reference the origin bufferlist (decode from MOSDPGPush
or
 ECSubReadReply message) whose size is much greater than attrs in
recovery.
 If obc cache it (get_obc maybe cache the attr), this causes the
whole origin
 bufferlist would not be free until obc is evicted from obc 
cache. So

rebuild
 the bufferlist before cache it.

 Fixes: #12565
 Signed-off-by: Ning Yao 
 Signed-off-by: Xinze Chi 
 (cherry picked from commit 
c5895d3fad9da0ab7f05f134c49e22795d5c61f3)


commit 51ea1ca7f4a7763bfeb110957cd8a6f33b8a1422
Author: xiexingguo <258156...@qq.com>
Date:   Thu Oct 29 20:04:11 2015 +0800

 Objecter: pool_op callback may hang forever.

 pool_op callback may hang forever due to osdmap update during 
reply

handling.
 Fixes: #13642
 Signed-off-by: xie xingguo 

 (cherry picked from commit 
00c6fa9e31975a935ed2bb33a099e2b4f02ad7f2)


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Infernalis for Debian 8 armhf

2015-12-01 Thread Swapnil Jain

Hi,

Any plans to release Infernalis Debian 8 binary packages for armhf. As I only 
see it for amd64.



—

Swapnil Jain


signature.asc
Description: Message signed with OpenPGP using GPGMail
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Number of OSD map versions

2015-12-01 Thread George Mihaiescu
Thanks Dan,

I'll use these ones from Infernalis:


[global]
osd map message max = 100

[osd]
osd map cache size = 200
osd map max advance = 150
osd map share max epochs = 100
osd pg epoch persisted max stale = 150


George

On Mon, Nov 30, 2015 at 4:20 PM, Dan van der Ster 
wrote:

> I wouldn't run with those settings in production. That was a test to
> squeeze too many OSDs into too little RAM.
>
> Check the values from infernalis/master. Those should be safe.
>
> --
> Dan
> On 30 Nov 2015 21:45, "George Mihaiescu"  wrote:
>
>> Hi,
>>
>> I've read the recommendation from CERN about the number of OSD maps (
>> https://cds.cern.ch/record/2015206/files/CephScaleTestMarch2015.pdf,
>> page 3) and I would like to know if there is any negative impact from these
>> changes:
>>
>> [global]
>> osd map message max = 10
>>
>> [osd]
>> osd map cache size = 20
>> osd map max advance = 10
>> osd map share max epochs = 10
>> osd pg epoch persisted max stale = 10
>>
>>
>> We are running Hammer with nowhere closer to 7000 OSDs, but I don't want
>> to waste memory on OSD maps which are not needed.
>>
>> Are there are large production deployments running with these or similar
>> settings?
>>
>> Thank you,
>> George
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] High 0.94.5 OSD memory use at 8GB RAM/TB raw disk during recovery

2015-12-01 Thread Laurent GUERBY
On Tue, 2015-12-01 at 13:51 -0600, Ryan Tokarek wrote:
> > On Nov 30, 2015, at 6:52 PM, Laurent GUERBY  wrote:
> > 
> > Hi,
> > 
> > We lost a disk today in our ceph cluster so we added a new machine with
> > 4 disks to replace the capacity and we activated straw1 tunable too
> > (we also tried straw2 but we quickly backed up this change).
> > 
> > During recovery OSD started crashing on all of our machines
> > the issue being OSD RAM usage that goes very high, eg:
> > 
> > 24078 root  20   0 27.784g 0.026t  10888 S   5.9 84.9
> > 16:23.63 /usr/bin/ceph-osd --cluster=ceph -i 41 -f
> > /dev/sda1   2.7T  2.2T  514G  82% /var/lib/ceph/osd/ceph-41
> > 
> > That's about 8GB resident RAM per TB of disk, way above
> > what we provisionned ~ 2-4 GB RAM/TB.
> 
> We had something vaguely similar (not nearly that dramatic though!) happen to 
> us. During a recovery (actually, I think this was rebalancing after upgrading 
> from an earlier version of ceph), our OSDs took so much memory they would get 
> killed by oom_killer and we couldn't keep the cluster up long enough to get 
> back to healthy. 
> 
> A solution for us was to enable zswap; previously we had been running with no 
> swap at all. 
> 
> If you are running a kernel newer than 3.11 (you might want more recent than 
> that as I believe there were major fixes after 3.17), then enabling zswap 
> allows the kernel to compress pages in memory before needing to touch disk. 
> The default max pool size for this is 20% of memory. There is extra CPU time 
> to compress/decompress, but it's much faster than going to disk, and the OSD 
> data appears to be quite compressible. For us, nothing actually made it to 
> the disk, but a swapfile must to be enabled for zswap to do its work. 
> 
> https://www.kernel.org/doc/Documentation/vm/zswap.txt
> http://askubuntu.com/questions/471912/zram-vs-zswap-vs-zcache-ultimate-guide-when-to-use-which-one
> 
> Add "zswap.enabled=1" to your kernel bool parameters and reboot. 
> 
> If you have no swap file/partition/disk/whatever, then you need one for zswap 
> to actually do anything. Here is an example, but use whatever sizes, 
> locations, process you prefer:
> 
> dd if=/dev/zero of=/var/swap bs=1M count=8192
> chmod 600 /var/swap
> mkswap /var/swap
> swapon /var/swap
> 
> Consider adding it to /etc/fstab:
> /var/swap swapswapdefaults 0 0 
> 
> This got us through the rebalancing. The OSDs eventually returned to normal, 
> but we've just left zswap enabled with no apparent problems. I don't know 
> that it will be enough for your situation, but it might help. 
> 
> Ryan

Hi Ryan,

Thanks for your suggestion!

We also managed to recover the cluster after about 15 hours of trying. 

We added 64G swapfile to hosts (taken on the OSD disks...), enabled
noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub,notieragent
stopped all ceph clients, manually stopped for a few hours OSD that
restarted too often (that action seemed to be the one that helped the
most to stabilize things), sending periodically ceph tell osd.N heap
release and restarting manually any suspicious OSD (slow request, or
infinite "currently waiting for rw locks", or indecent RAM use).

On the "waiting for rw locks" may be backporting to 0.94.6
http://tracker.ceph.com/issues/13821
would help.

Loic, is there a test for a cluster where you get the OSD
near max RAM of the host (eg: lots of small objects/pg, small amount of
memory on the node) then you kill one third of the OSD and check that it
recovers on the two other thirds alive without getting OOM? Next step
would be to periodically stop, wait and restart a given number of the
OSD and see if things stabilize RAM wise.

Sincerely,

Laurent

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD on a partition

2015-12-01 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

The typecode allows udev to automatically boot the OSD when it is
detected. From what I remember, ceph-disk will prepare the partition
regardless if the typecode is there, but it has been so long since
I've tried. I've used this on Hammer as recently as two weeks ago.
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Dec 1, 2015 at 3:06 PM, Marek Dohojda  wrote:
>
> Didn’t mean to send previous email, I appologize for the spam.
>
> Anyway Thank you, I will give that a shot.
>
> Whenever I tried to use ceph-disk prepare on a partition I get OSD 
> compalining that this isn’t a block device.  Will the sgdisk typecode fix 
> that?
>
>
>
>
>> On Dec 1, 2015, at 3:03 PM, Robert LeBlanc  wrote:
>>
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> The documentation is a little sparse in this regard, here is what I use:
>>
>> sgdisk --new=1:0:+10240M --change-name=1:"ceph journal"
>> --typecode=1:45b0969e-9b03-4f30-b4c6-b4b80ceff106 /dev/sdc
>> sgdisk --new=2:0:0 --change-name=2:"ceph data"
>> --typecode=2:4fbd7e29-9d25-41b8-afd0-062c0ceff05d /dev/sdc
>>
>> This creates a partition that is 10 GB from the front of the drive as
>> a journal, then uses the rest of the drive for an OSD. You can then
>> use ceph-disk prepare /dev/sdc2 [/dev/sdc1].
>>
>> Adapt it to fit your needs.
>> -BEGIN PGP SIGNATURE-
>> Version: Mailvelope v1.3.0
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWXhkfCRDmVDuy+mK58QAAmEAP/2RtZfhDrnsS0ME3mya/
>> uCYhiXbEdTk+mhmJokC00bHwiZczIf11OUcHIZahvQ/Z/a5ojLIIJWnaSNbc
>> TMLLiHvfbsm8Rs8cMKzpX0NuGgJNefS3M7XpnacgC6ZmE31Rtnd1bi7ThqnK
>> CYzgyS/m5kjmRVwJNr76PpGx6tPiFC3oZgDesq0bm0T97RDjfyYXB3wxkVWY
>> V3u51m7CLa9y3rvNbdGmwiWoR6jhmFMic5tCLJYD6zKnvhhq6P6OLM4RA+6P
>> quSaFUAmZ/JrMWPEY3/B+lRx3j4kdXue2OJIgRQf7XiSJpeubFgVGtxSzBYz
>> OYsV09fOpS7dLojXtmsrQekSIQIGqy3PZMl/WfQVNdQ+etVOenR+8CBhTTst
>> or8fu+s8n+T9brcvFP2cfwickF5Rp+tVc3d238l+Kbc4t6SLtx71q5/AiQpR
>> 8mEOvRHlTTxoaozleepuw7xnymnNShFogwzCXYj7DoaBMxTT4igWfHwWb5/I
>> 0R5bYkheBkYxLlVaf7faUWcjySwunW1SY/rc2FkUFe52VlZ5cbFfJ+ym0an5
>> i5SdfLd0gk4zR5l35j7svdJZU9+QIZLcz/S12Nx5mwUxhnhEeqYMBS/ENSca
>> tKq4nlqyIGaCyDaLlcaECRLBjskrNRMeV7vnNUQ59BzJuMWOHhq571zHeXYO
>> tezS
>> =mxz9
>> -END PGP SIGNATURE-
>> 
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Tue, Dec 1, 2015 at 2:46 PM, Marek Dohojda
>>  wrote:
>>> Thus far parted isn’t very nice with ext4, I suppose I could try to do so
>>> and the worst that would happen is me loosing 1 OSD, however in my tests
>>> this wasn’t very reliable.
>>>
>>> Non GPT partition, utilizing fdisk I can do this without a problem, but OSD
>>> requires GPT (to the best of my knowledge anyway).
>>>
>>> Hence why I would like to know if there is a way for me to do a partition
>>> from the get go.  Since if the shrink doesn’t work my only other option is
>>> this.  Unless of course I create a directory on the OSD file system and
>>> simlink the Spindle Journal within that new directory something like
>>>
>>> ln -s /var/lib/ceph/osd-0/spin_journal/journal /var/lib/ceph/osd-2/journal
>>>
>>> I feel that this approach is not very clean though.
>>>
>>>
>>>
>>> On Dec 1, 2015, at 12:39 PM, Nick Fisk  wrote:
>>>
>>>
>>>
>>>
>>>
>>> -Original Message-
>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>>> Marek Dohojda
>>> Sent: 01 December 2015 19:34
>>> To: Wido den Hollander
>>> Cc: ceph-users@lists.ceph.com
>>> Subject: Re: [ceph-users] OSD on a partition
>>>
>>> Well so here is my problem.  I want to move journal to SSD, but I have no
>>> more Disk slots available, and the SSD I do have is used for my higher IO
>>> OSDs.  Hence I don’t want to lose my OSD.
>>>
>>> So my thought was to partition the SSD into 10G and the rest with the
>>>
>>> “rest”
>>>
>>> being used for OSD, while the 10G be used for Journal.  However I can’t
>>>
>>> find
>>>
>>> a reliable way to put OSD on a partition which is why I am wondering if
>>>
>>> there
>>>
>>> is a way to do it.
>>>
>>>
>>> I'm wondering if you can stop the SSD OSD, unmount, shrink the partition,
>>> add the extra journal partition, start the OSD.
>>>
>>> Then stop the disk based OSD, flush the journal, move to new partition on
>>> SSD and then start it.
>>>
>>>
>>>
>>> Alernatively I could put the Journal on the SSD itself (it is ext4 file
>>>
>>> system) but
>>>
>>> not sure if that wouldn’t be bad from perspective of Ceph to do.
>>>
>>> Down the road I will have more SSD but this won’t happen until new budget
>>> hits and I can get more servers.
>>>
>>>
>>>
>>> On Dec 1, 2015, at 12:11 PM, Wido den Hollander
>>>
>>> wrote:
>>>
>>>
>>> On 12/01/2015 07:29 PM, Marek Dohojda wrote:
>>>
>>> I am looking through google, and I am not seeing a good guide as to
>>> how to put an OSD on a partition (GPT) of a disk.  I see lots of
>>> options for file sys

Re: [ceph-users] OSD on a partition

2015-12-01 Thread Marek Dohojda

Didn’t mean to send previous email, I appologize for the spam.

Anyway Thank you, I will give that a shot.  

Whenever I tried to use ceph-disk prepare on a partition I get OSD compalining 
that this isn’t a block device.  Will the sgdisk typecode fix that?




> On Dec 1, 2015, at 3:03 PM, Robert LeBlanc  wrote:
> 
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> The documentation is a little sparse in this regard, here is what I use:
> 
> sgdisk --new=1:0:+10240M --change-name=1:"ceph journal"
> --typecode=1:45b0969e-9b03-4f30-b4c6-b4b80ceff106 /dev/sdc
> sgdisk --new=2:0:0 --change-name=2:"ceph data"
> --typecode=2:4fbd7e29-9d25-41b8-afd0-062c0ceff05d /dev/sdc
> 
> This creates a partition that is 10 GB from the front of the drive as
> a journal, then uses the rest of the drive for an OSD. You can then
> use ceph-disk prepare /dev/sdc2 [/dev/sdc1].
> 
> Adapt it to fit your needs.
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.3.0
> Comment: https://www.mailvelope.com
> 
> wsFcBAEBCAAQBQJWXhkfCRDmVDuy+mK58QAAmEAP/2RtZfhDrnsS0ME3mya/
> uCYhiXbEdTk+mhmJokC00bHwiZczIf11OUcHIZahvQ/Z/a5ojLIIJWnaSNbc
> TMLLiHvfbsm8Rs8cMKzpX0NuGgJNefS3M7XpnacgC6ZmE31Rtnd1bi7ThqnK
> CYzgyS/m5kjmRVwJNr76PpGx6tPiFC3oZgDesq0bm0T97RDjfyYXB3wxkVWY
> V3u51m7CLa9y3rvNbdGmwiWoR6jhmFMic5tCLJYD6zKnvhhq6P6OLM4RA+6P
> quSaFUAmZ/JrMWPEY3/B+lRx3j4kdXue2OJIgRQf7XiSJpeubFgVGtxSzBYz
> OYsV09fOpS7dLojXtmsrQekSIQIGqy3PZMl/WfQVNdQ+etVOenR+8CBhTTst
> or8fu+s8n+T9brcvFP2cfwickF5Rp+tVc3d238l+Kbc4t6SLtx71q5/AiQpR
> 8mEOvRHlTTxoaozleepuw7xnymnNShFogwzCXYj7DoaBMxTT4igWfHwWb5/I
> 0R5bYkheBkYxLlVaf7faUWcjySwunW1SY/rc2FkUFe52VlZ5cbFfJ+ym0an5
> i5SdfLd0gk4zR5l35j7svdJZU9+QIZLcz/S12Nx5mwUxhnhEeqYMBS/ENSca
> tKq4nlqyIGaCyDaLlcaECRLBjskrNRMeV7vnNUQ59BzJuMWOHhq571zHeXYO
> tezS
> =mxz9
> -END PGP SIGNATURE-
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Tue, Dec 1, 2015 at 2:46 PM, Marek Dohojda
>  wrote:
>> Thus far parted isn’t very nice with ext4, I suppose I could try to do so
>> and the worst that would happen is me loosing 1 OSD, however in my tests
>> this wasn’t very reliable.
>> 
>> Non GPT partition, utilizing fdisk I can do this without a problem, but OSD
>> requires GPT (to the best of my knowledge anyway).
>> 
>> Hence why I would like to know if there is a way for me to do a partition
>> from the get go.  Since if the shrink doesn’t work my only other option is
>> this.  Unless of course I create a directory on the OSD file system and
>> simlink the Spindle Journal within that new directory something like
>> 
>> ln -s /var/lib/ceph/osd-0/spin_journal/journal /var/lib/ceph/osd-2/journal
>> 
>> I feel that this approach is not very clean though.
>> 
>> 
>> 
>> On Dec 1, 2015, at 12:39 PM, Nick Fisk  wrote:
>> 
>> 
>> 
>> 
>> 
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Marek Dohojda
>> Sent: 01 December 2015 19:34
>> To: Wido den Hollander 
>> Cc: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] OSD on a partition
>> 
>> Well so here is my problem.  I want to move journal to SSD, but I have no
>> more Disk slots available, and the SSD I do have is used for my higher IO
>> OSDs.  Hence I don’t want to lose my OSD.
>> 
>> So my thought was to partition the SSD into 10G and the rest with the
>> 
>> “rest”
>> 
>> being used for OSD, while the 10G be used for Journal.  However I can’t
>> 
>> find
>> 
>> a reliable way to put OSD on a partition which is why I am wondering if
>> 
>> there
>> 
>> is a way to do it.
>> 
>> 
>> I'm wondering if you can stop the SSD OSD, unmount, shrink the partition,
>> add the extra journal partition, start the OSD.
>> 
>> Then stop the disk based OSD, flush the journal, move to new partition on
>> SSD and then start it.
>> 
>> 
>> 
>> Alernatively I could put the Journal on the SSD itself (it is ext4 file
>> 
>> system) but
>> 
>> not sure if that wouldn’t be bad from perspective of Ceph to do.
>> 
>> Down the road I will have more SSD but this won’t happen until new budget
>> hits and I can get more servers.
>> 
>> 
>> 
>> On Dec 1, 2015, at 12:11 PM, Wido den Hollander 
>> 
>> wrote:
>> 
>> 
>> On 12/01/2015 07:29 PM, Marek Dohojda wrote:
>> 
>> I am looking through google, and I am not seeing a good guide as to
>> how to put an OSD on a partition (GPT) of a disk.  I see lots of
>> options for file system, or single physical drive but not partition.
>> 
>> http://dachary.org/?p=2548
>> 
>> This is only thing I found but that is from 2 years ago and no
>> comments if this works or not.
>> 
>> Is there a better guide/best practice for such a scenario?
>> 
>> 
>> Well, what is the thing you are trying to achieve? All tools want full
>> disks, but an OSD doesn't want it persé. It just wants a mount point
>> where it can write data to.
>> 
>> You can always manually bootstrap a cluster if you want to.
>> 
>> 
>> 
>> 
>> 

Re: [ceph-users] OSD on a partition

2015-12-01 Thread Marek Dohojda
Thank you!

I w
 

> On Dec 1, 2015, at 3:03 PM, Robert LeBlanc  wrote:
> 
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> The documentation is a little sparse in this regard, here is what I use:
> 
> sgdisk --new=1:0:+10240M --change-name=1:"ceph journal"
> --typecode=1:45b0969e-9b03-4f30-b4c6-b4b80ceff106 /dev/sdc
> sgdisk --new=2:0:0 --change-name=2:"ceph data"
> --typecode=2:4fbd7e29-9d25-41b8-afd0-062c0ceff05d /dev/sdc
> 
> This creates a partition that is 10 GB from the front of the drive as
> a journal, then uses the rest of the drive for an OSD. You can then
> use ceph-disk prepare /dev/sdc2 [/dev/sdc1].
> 
> Adapt it to fit your needs.
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.3.0
> Comment: https://www.mailvelope.com
> 
> wsFcBAEBCAAQBQJWXhkfCRDmVDuy+mK58QAAmEAP/2RtZfhDrnsS0ME3mya/
> uCYhiXbEdTk+mhmJokC00bHwiZczIf11OUcHIZahvQ/Z/a5ojLIIJWnaSNbc
> TMLLiHvfbsm8Rs8cMKzpX0NuGgJNefS3M7XpnacgC6ZmE31Rtnd1bi7ThqnK
> CYzgyS/m5kjmRVwJNr76PpGx6tPiFC3oZgDesq0bm0T97RDjfyYXB3wxkVWY
> V3u51m7CLa9y3rvNbdGmwiWoR6jhmFMic5tCLJYD6zKnvhhq6P6OLM4RA+6P
> quSaFUAmZ/JrMWPEY3/B+lRx3j4kdXue2OJIgRQf7XiSJpeubFgVGtxSzBYz
> OYsV09fOpS7dLojXtmsrQekSIQIGqy3PZMl/WfQVNdQ+etVOenR+8CBhTTst
> or8fu+s8n+T9brcvFP2cfwickF5Rp+tVc3d238l+Kbc4t6SLtx71q5/AiQpR
> 8mEOvRHlTTxoaozleepuw7xnymnNShFogwzCXYj7DoaBMxTT4igWfHwWb5/I
> 0R5bYkheBkYxLlVaf7faUWcjySwunW1SY/rc2FkUFe52VlZ5cbFfJ+ym0an5
> i5SdfLd0gk4zR5l35j7svdJZU9+QIZLcz/S12Nx5mwUxhnhEeqYMBS/ENSca
> tKq4nlqyIGaCyDaLlcaECRLBjskrNRMeV7vnNUQ59BzJuMWOHhq571zHeXYO
> tezS
> =mxz9
> -END PGP SIGNATURE-
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Tue, Dec 1, 2015 at 2:46 PM, Marek Dohojda
>  wrote:
>> Thus far parted isn’t very nice with ext4, I suppose I could try to do so
>> and the worst that would happen is me loosing 1 OSD, however in my tests
>> this wasn’t very reliable.
>> 
>> Non GPT partition, utilizing fdisk I can do this without a problem, but OSD
>> requires GPT (to the best of my knowledge anyway).
>> 
>> Hence why I would like to know if there is a way for me to do a partition
>> from the get go.  Since if the shrink doesn’t work my only other option is
>> this.  Unless of course I create a directory on the OSD file system and
>> simlink the Spindle Journal within that new directory something like
>> 
>> ln -s /var/lib/ceph/osd-0/spin_journal/journal /var/lib/ceph/osd-2/journal
>> 
>> I feel that this approach is not very clean though.
>> 
>> 
>> 
>> On Dec 1, 2015, at 12:39 PM, Nick Fisk  wrote:
>> 
>> 
>> 
>> 
>> 
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Marek Dohojda
>> Sent: 01 December 2015 19:34
>> To: Wido den Hollander 
>> Cc: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] OSD on a partition
>> 
>> Well so here is my problem.  I want to move journal to SSD, but I have no
>> more Disk slots available, and the SSD I do have is used for my higher IO
>> OSDs.  Hence I don’t want to lose my OSD.
>> 
>> So my thought was to partition the SSD into 10G and the rest with the
>> 
>> “rest”
>> 
>> being used for OSD, while the 10G be used for Journal.  However I can’t
>> 
>> find
>> 
>> a reliable way to put OSD on a partition which is why I am wondering if
>> 
>> there
>> 
>> is a way to do it.
>> 
>> 
>> I'm wondering if you can stop the SSD OSD, unmount, shrink the partition,
>> add the extra journal partition, start the OSD.
>> 
>> Then stop the disk based OSD, flush the journal, move to new partition on
>> SSD and then start it.
>> 
>> 
>> 
>> Alernatively I could put the Journal on the SSD itself (it is ext4 file
>> 
>> system) but
>> 
>> not sure if that wouldn’t be bad from perspective of Ceph to do.
>> 
>> Down the road I will have more SSD but this won’t happen until new budget
>> hits and I can get more servers.
>> 
>> 
>> 
>> On Dec 1, 2015, at 12:11 PM, Wido den Hollander 
>> 
>> wrote:
>> 
>> 
>> On 12/01/2015 07:29 PM, Marek Dohojda wrote:
>> 
>> I am looking through google, and I am not seeing a good guide as to
>> how to put an OSD on a partition (GPT) of a disk.  I see lots of
>> options for file system, or single physical drive but not partition.
>> 
>> http://dachary.org/?p=2548
>> 
>> This is only thing I found but that is from 2 years ago and no
>> comments if this works or not.
>> 
>> Is there a better guide/best practice for such a scenario?
>> 
>> 
>> Well, what is the thing you are trying to achieve? All tools want full
>> disks, but an OSD doesn't want it persé. It just wants a mount point
>> where it can write data to.
>> 
>> You can always manually bootstrap a cluster if you want to.
>> 
>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 
>> 
>> --
>> Wido den Hollander
>> 42on B.V.
>> Ceph trainer and consultant
>> 
>> Phone: +31 (0)20 700 9902
>> Skype:

Re: [ceph-users] OSD on a partition

2015-12-01 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

The documentation is a little sparse in this regard, here is what I use:

sgdisk --new=1:0:+10240M --change-name=1:"ceph journal"
--typecode=1:45b0969e-9b03-4f30-b4c6-b4b80ceff106 /dev/sdc
sgdisk --new=2:0:0 --change-name=2:"ceph data"
--typecode=2:4fbd7e29-9d25-41b8-afd0-062c0ceff05d /dev/sdc

This creates a partition that is 10 GB from the front of the drive as
a journal, then uses the rest of the drive for an OSD. You can then
use ceph-disk prepare /dev/sdc2 [/dev/sdc1].

Adapt it to fit your needs.
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWXhkfCRDmVDuy+mK58QAAmEAP/2RtZfhDrnsS0ME3mya/
uCYhiXbEdTk+mhmJokC00bHwiZczIf11OUcHIZahvQ/Z/a5ojLIIJWnaSNbc
TMLLiHvfbsm8Rs8cMKzpX0NuGgJNefS3M7XpnacgC6ZmE31Rtnd1bi7ThqnK
CYzgyS/m5kjmRVwJNr76PpGx6tPiFC3oZgDesq0bm0T97RDjfyYXB3wxkVWY
V3u51m7CLa9y3rvNbdGmwiWoR6jhmFMic5tCLJYD6zKnvhhq6P6OLM4RA+6P
quSaFUAmZ/JrMWPEY3/B+lRx3j4kdXue2OJIgRQf7XiSJpeubFgVGtxSzBYz
OYsV09fOpS7dLojXtmsrQekSIQIGqy3PZMl/WfQVNdQ+etVOenR+8CBhTTst
or8fu+s8n+T9brcvFP2cfwickF5Rp+tVc3d238l+Kbc4t6SLtx71q5/AiQpR
8mEOvRHlTTxoaozleepuw7xnymnNShFogwzCXYj7DoaBMxTT4igWfHwWb5/I
0R5bYkheBkYxLlVaf7faUWcjySwunW1SY/rc2FkUFe52VlZ5cbFfJ+ym0an5
i5SdfLd0gk4zR5l35j7svdJZU9+QIZLcz/S12Nx5mwUxhnhEeqYMBS/ENSca
tKq4nlqyIGaCyDaLlcaECRLBjskrNRMeV7vnNUQ59BzJuMWOHhq571zHeXYO
tezS
=mxz9
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Dec 1, 2015 at 2:46 PM, Marek Dohojda
 wrote:
> Thus far parted isn’t very nice with ext4, I suppose I could try to do so
> and the worst that would happen is me loosing 1 OSD, however in my tests
> this wasn’t very reliable.
>
> Non GPT partition, utilizing fdisk I can do this without a problem, but OSD
> requires GPT (to the best of my knowledge anyway).
>
> Hence why I would like to know if there is a way for me to do a partition
> from the get go.  Since if the shrink doesn’t work my only other option is
> this.  Unless of course I create a directory on the OSD file system and
> simlink the Spindle Journal within that new directory something like
>
> ln -s /var/lib/ceph/osd-0/spin_journal/journal /var/lib/ceph/osd-2/journal
>
> I feel that this approach is not very clean though.
>
>
>
> On Dec 1, 2015, at 12:39 PM, Nick Fisk  wrote:
>
>
>
>
>
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Marek Dohojda
> Sent: 01 December 2015 19:34
> To: Wido den Hollander 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] OSD on a partition
>
> Well so here is my problem.  I want to move journal to SSD, but I have no
> more Disk slots available, and the SSD I do have is used for my higher IO
> OSDs.  Hence I don’t want to lose my OSD.
>
> So my thought was to partition the SSD into 10G and the rest with the
>
> “rest”
>
> being used for OSD, while the 10G be used for Journal.  However I can’t
>
> find
>
> a reliable way to put OSD on a partition which is why I am wondering if
>
> there
>
> is a way to do it.
>
>
> I'm wondering if you can stop the SSD OSD, unmount, shrink the partition,
> add the extra journal partition, start the OSD.
>
> Then stop the disk based OSD, flush the journal, move to new partition on
> SSD and then start it.
>
>
>
> Alernatively I could put the Journal on the SSD itself (it is ext4 file
>
> system) but
>
> not sure if that wouldn’t be bad from perspective of Ceph to do.
>
> Down the road I will have more SSD but this won’t happen until new budget
> hits and I can get more servers.
>
>
>
> On Dec 1, 2015, at 12:11 PM, Wido den Hollander 
>
> wrote:
>
>
> On 12/01/2015 07:29 PM, Marek Dohojda wrote:
>
> I am looking through google, and I am not seeing a good guide as to
> how to put an OSD on a partition (GPT) of a disk.  I see lots of
> options for file system, or single physical drive but not partition.
>
> http://dachary.org/?p=2548
>
> This is only thing I found but that is from 2 years ago and no
> comments if this works or not.
>
> Is there a better guide/best practice for such a scenario?
>
>
> Well, what is the thing you are trying to achieve? All tools want full
> disks, but an OSD doesn't want it persé. It just wants a mount point
> where it can write data to.
>
> You can always manually bootstrap a cluster if you want to.
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Wido den Hollander
> 42on B.V.
> Ceph trainer and consultant
>
> Phone: +31 (0)20 700 9902
> Skype: contact42on
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/

Re: [ceph-users] OSD on a partition

2015-12-01 Thread Marek Dohojda
Thus far parted isn’t very nice with ext4, I suppose I could try to do so and 
the worst that would happen is me loosing 1 OSD, however in my tests this 
wasn’t very reliable.  

Non GPT partition, utilizing fdisk I can do this without a problem, but OSD 
requires GPT (to the best of my knowledge anyway).

Hence why I would like to know if there is a way for me to do a partition from 
the get go.  Since if the shrink doesn’t work my only other option is this.  
Unless of course I create a directory on the OSD file system and simlink the 
Spindle Journal within that new directory something like

ln -s /var/lib/ceph/osd-0/spin_journal/journal /var/lib/ceph/osd-2/journal

I feel that this approach is not very clean though.



> On Dec 1, 2015, at 12:39 PM, Nick Fisk  wrote:
> 
> 
> 
> 
> 
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
>> ] On Behalf Of
>> Marek Dohojda
>> Sent: 01 December 2015 19:34
>> To: Wido den Hollander mailto:w...@42on.com>>
>> Cc: ceph-users@lists.ceph.com 
>> Subject: Re: [ceph-users] OSD on a partition
>> 
>> Well so here is my problem.  I want to move journal to SSD, but I have no
>> more Disk slots available, and the SSD I do have is used for my higher IO
>> OSDs.  Hence I don’t want to lose my OSD.
>> 
>> So my thought was to partition the SSD into 10G and the rest with the
> “rest”
>> being used for OSD, while the 10G be used for Journal.  However I can’t
> find
>> a reliable way to put OSD on a partition which is why I am wondering if
> there
>> is a way to do it.
> 
> I'm wondering if you can stop the SSD OSD, unmount, shrink the partition,
> add the extra journal partition, start the OSD.
> 
> Then stop the disk based OSD, flush the journal, move to new partition on
> SSD and then start it.
> 
>> 
>> Alernatively I could put the Journal on the SSD itself (it is ext4 file
> system) but
>> not sure if that wouldn’t be bad from perspective of Ceph to do.
>> 
>> Down the road I will have more SSD but this won’t happen until new budget
>> hits and I can get more servers.
>> 
>> 
>> 
>>> On Dec 1, 2015, at 12:11 PM, Wido den Hollander 
>> wrote:
>>> 
>>> On 12/01/2015 07:29 PM, Marek Dohojda wrote:
 I am looking through google, and I am not seeing a good guide as to
 how to put an OSD on a partition (GPT) of a disk.  I see lots of
 options for file system, or single physical drive but not partition.
 
 http://dachary.org/?p=2548
 
 This is only thing I found but that is from 2 years ago and no
 comments if this works or not.
 
 Is there a better guide/best practice for such a scenario?
 
>>> 
>>> Well, what is the thing you are trying to achieve? All tools want full
>>> disks, but an OSD doesn't want it persé. It just wants a mount point
>>> where it can write data to.
>>> 
>>> You can always manually bootstrap a cluster if you want to.
>>> 
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
>>> 
>>> 
>>> --
>>> Wido den Hollander
>>> 42on B.V.
>>> Ceph trainer and consultant
>>> 
>>> Phone: +31 (0)20 700 9902
>>> Skype: contact42on
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] does anyone know what xfsaild and kworker are?they make osd disk busy. produce 100-200iops per osd disk?

2015-12-01 Thread Somnath Roy
Sure..The following settings helped me minimizing the effect a bit for the PR 
https://github.com/ceph/ceph/pull/6670


  sysctl -w fs.xfs.xfssyncd_centisecs=72
  sysctl -w fs.xfs.xfsbufd_centisecs=3000
  sysctl -w fs.xfs.age_buffer_centisecs=72

But, for existing Ceph write path you may need to tweak this..

Thanks & Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of flisky
Sent: Tuesday, December 01, 2015 11:04 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] does anyone know what xfsaild and kworker are?they 
make osd disk busy. produce 100-200iops per osd disk?

On 2015年12月02日 01:31, Somnath Roy wrote:
> This is xfs metadata sync process...when it is waking up and there are lot of 
> data to sync it will throttle all the process accessing the drive...There are 
> some xfs settings to control the behavior, but you can't stop that
May I ask how to tune the xfs settings? Thanks!

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cinder-CEPH Job Openings with @WalmartLabs [Location: India, Bangalore]

2015-12-01 Thread Janardhan Husthimme
Hello,

I am looking for hiring best CEPH minds to work on building scalable Block and 
Object storage solutions (In progress), If interested do drop me an email/CV.

Note: Job location is India, Bangalore.

Thanks,
Janardhan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] High 0.94.5 OSD memory use at 8GB RAM/TB raw disk during recovery

2015-12-01 Thread Ryan Tokarek

> On Nov 30, 2015, at 6:52 PM, Laurent GUERBY  wrote:
> 
> Hi,
> 
> We lost a disk today in our ceph cluster so we added a new machine with
> 4 disks to replace the capacity and we activated straw1 tunable too
> (we also tried straw2 but we quickly backed up this change).
> 
> During recovery OSD started crashing on all of our machines
> the issue being OSD RAM usage that goes very high, eg:
> 
> 24078 root  20   0 27.784g 0.026t  10888 S   5.9 84.9
> 16:23.63 /usr/bin/ceph-osd --cluster=ceph -i 41 -f
> /dev/sda1   2.7T  2.2T  514G  82% /var/lib/ceph/osd/ceph-41
> 
> That's about 8GB resident RAM per TB of disk, way above
> what we provisionned ~ 2-4 GB RAM/TB.

We had something vaguely similar (not nearly that dramatic though!) happen to 
us. During a recovery (actually, I think this was rebalancing after upgrading 
from an earlier version of ceph), our OSDs took so much memory they would get 
killed by oom_killer and we couldn't keep the cluster up long enough to get 
back to healthy. 

A solution for us was to enable zswap; previously we had been running with no 
swap at all. 

If you are running a kernel newer than 3.11 (you might want more recent than 
that as I believe there were major fixes after 3.17), then enabling zswap 
allows the kernel to compress pages in memory before needing to touch disk. The 
default max pool size for this is 20% of memory. There is extra CPU time to 
compress/decompress, but it's much faster than going to disk, and the OSD data 
appears to be quite compressible. For us, nothing actually made it to the disk, 
but a swapfile must to be enabled for zswap to do its work. 

https://www.kernel.org/doc/Documentation/vm/zswap.txt
http://askubuntu.com/questions/471912/zram-vs-zswap-vs-zcache-ultimate-guide-when-to-use-which-one

Add "zswap.enabled=1" to your kernel bool parameters and reboot. 

If you have no swap file/partition/disk/whatever, then you need one for zswap 
to actually do anything. Here is an example, but use whatever sizes, locations, 
process you prefer:

dd if=/dev/zero of=/var/swap bs=1M count=8192
chmod 600 /var/swap
mkswap /var/swap
swapon /var/swap

Consider adding it to /etc/fstab:
/var/swap   swapswapdefaults 0 0 

This got us through the rebalancing. The OSDs eventually returned to normal, 
but we've just left zswap enabled with no apparent problems. I don't know that 
it will be enough for your situation, but it might help. 

Ryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD on a partition

2015-12-01 Thread Nick Fisk




> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Marek Dohojda
> Sent: 01 December 2015 19:34
> To: Wido den Hollander 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] OSD on a partition
> 
> Well so here is my problem.  I want to move journal to SSD, but I have no
> more Disk slots available, and the SSD I do have is used for my higher IO
> OSDs.  Hence I don’t want to lose my OSD.
> 
> So my thought was to partition the SSD into 10G and the rest with the
“rest”
> being used for OSD, while the 10G be used for Journal.  However I can’t
find
> a reliable way to put OSD on a partition which is why I am wondering if
there
> is a way to do it.

I'm wondering if you can stop the SSD OSD, unmount, shrink the partition,
add the extra journal partition, start the OSD.

Then stop the disk based OSD, flush the journal, move to new partition on
SSD and then start it.

> 
> Alernatively I could put the Journal on the SSD itself (it is ext4 file
system) but
> not sure if that wouldn’t be bad from perspective of Ceph to do.
> 
> Down the road I will have more SSD but this won’t happen until new budget
> hits and I can get more servers.
> 
> 
> 
> > On Dec 1, 2015, at 12:11 PM, Wido den Hollander 
> wrote:
> >
> > On 12/01/2015 07:29 PM, Marek Dohojda wrote:
> >> I am looking through google, and I am not seeing a good guide as to
> >> how to put an OSD on a partition (GPT) of a disk.  I see lots of
> >> options for file system, or single physical drive but not partition.
> >>
> >> http://dachary.org/?p=2548
> >>
> >> This is only thing I found but that is from 2 years ago and no
> >> comments if this works or not.
> >>
> >> Is there a better guide/best practice for such a scenario?
> >>
> >
> > Well, what is the thing you are trying to achieve? All tools want full
> > disks, but an OSD doesn't want it persé. It just wants a mount point
> > where it can write data to.
> >
> > You can always manually bootstrap a cluster if you want to.
> >
> >>
> >>
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >
> >
> > --
> > Wido den Hollander
> > 42on B.V.
> > Ceph trainer and consultant
> >
> > Phone: +31 (0)20 700 9902
> > Skype: contact42on
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD on a partition

2015-12-01 Thread Marek Dohojda
Well so here is my problem.  I want to move journal to SSD, but I have no more 
Disk slots available, and the SSD I do have is used for my higher IO OSDs.  
Hence I don’t want to lose my OSD.  

So my thought was to partition the SSD into 10G and the rest with the “rest” 
being used for OSD, while the 10G be used for Journal.  However I can’t find a 
reliable way to put OSD on a partition which is why I am wondering if there is 
a way to do it.

Alernatively I could put the Journal on the SSD itself (it is ext4 file system) 
but not sure if that wouldn’t be bad from perspective of Ceph to do. 

Down the road I will have more SSD but this won’t happen until new budget hits 
and I can get more servers.



> On Dec 1, 2015, at 12:11 PM, Wido den Hollander  wrote:
> 
> On 12/01/2015 07:29 PM, Marek Dohojda wrote:
>> I am looking through google, and I am not seeing a good guide as to how
>> to put an OSD on a partition (GPT) of a disk.  I see lots of options for
>> file system, or single physical drive but not partition.  
>> 
>> http://dachary.org/?p=2548
>> 
>> This is only thing I found but that is from 2 years ago and no comments
>> if this works or not.
>> 
>> Is there a better guide/best practice for such a scenario?
>> 
> 
> Well, what is the thing you are trying to achieve? All tools want full
> disks, but an OSD doesn't want it persé. It just wants a mount point
> where it can write data to.
> 
> You can always manually bootstrap a cluster if you want to.
> 
>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> 
> -- 
> Wido den Hollander
> 42on B.V.
> Ceph trainer and consultant
> 
> Phone: +31 (0)20 700 9902
> Skype: contact42on
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD on a partition

2015-12-01 Thread Wido den Hollander
On 12/01/2015 07:29 PM, Marek Dohojda wrote:
> I am looking through google, and I am not seeing a good guide as to how
> to put an OSD on a partition (GPT) of a disk.  I see lots of options for
> file system, or single physical drive but not partition.  
> 
> http://dachary.org/?p=2548
> 
> This is only thing I found but that is from 2 years ago and no comments
> if this works or not.
> 
> Is there a better guide/best practice for such a scenario?
> 

Well, what is the thing you are trying to achieve? All tools want full
disks, but an OSD doesn't want it persé. It just wants a mount point
where it can write data to.

You can always manually bootstrap a cluster if you want to.

> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] does anyone know what xfsaild and kworker are?they make osd disk busy. produce 100-200iops per osd disk?

2015-12-01 Thread flisky
On 2015年12月02日 01:31, Somnath Roy wrote:
> This is xfs metadata sync process...when it is waking up and there are lot of 
> data to sync it will throttle all the process accessing the drive...There are 
> some xfs settings to control the behavior, but you can't stop that
May I ask how to tune the xfs settings? Thanks!

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Flapping OSDs, Large meta directories in OSDs

2015-12-01 Thread Tom Christensen
Another "new" thing we see with hammer is constant:

mon.0 [INF] from='client.52217412 :/0' entity='client.admin'
cmd='[{"prefix": "osd blacklist", "blacklistop": "add", "addr":
":0/3562049007"}]': finished

entries in the log and while watching ceph -w

The cluster appears to generate a new osdmap after every one of these
entries.  These appear to be associated with client connect/disconnect
operations.  Unfortunately in our use case we use librbd to connect and
disconnect a lot.  What does this new message indicate?  Can it be disabled
or turned off? so that librbd sessions don't cause a new osdmap to be
generated?

In ceph -w output, whenever we see those entries, we immediately see a new
osdmap, hence my suspicion that this message is causing a new osdmap to be
generated.




On Tue, Dec 1, 2015 at 11:02 AM, Tom Christensen  wrote:

> Another thing that we don't quite grasp is that when we see slow requests
> now they almost always, probably 95% have the "known_if_redirected" state
> set.  What does this state mean?  Does it indicate we have OSD maps that
> are lagging and the cluster isn't really in sync?  Could this be the cause
> of our growing osdmaps?
>
> -Tom
>
>
> On Tue, Dec 1, 2015 at 2:35 AM, HEWLETT, Paul (Paul) <
> paul.hewl...@alcatel-lucent.com> wrote:
>
>> I believe that ‘filestore xattr use omap’ is no longer used in Ceph – can
>> anybody confirm this?
>> I could not find any usage in the Ceph source code except that the value
>> is set in some of the test software…
>>
>> Paul
>>
>>
>> From: ceph-users  on behalf of Tom
>> Christensen 
>> Date: Monday, 30 November 2015 at 23:20
>> To: "ceph-users@lists.ceph.com" 
>> Subject: Re: [ceph-users] Flapping OSDs, Large meta directories in OSDs
>>
>> What counts as ancient?  Concurrent to our hammer upgrade we went from
>> 3.16->3.19 on ubuntu 14.04.  We are looking to revert to the 3.16 kernel
>> we'd been running because we're also seeing an intermittent (its happened
>> twice in 2 weeks) massive load spike that completely hangs the osd node
>> (we're talking about load averages that hit 20k+ before the box becomes
>> completely unresponsive).  We saw a similar behavior on a 3.13 kernel,
>> which resolved by moving to the 3.16 kernel we had before.  I'll try to
>> catch one with debug_ms=1 and see if I can see it we're hitting a similar
>> hang.
>>
>> To your comment about omap, we do have filestore xattr use omap = true in
>> our conf... which we believe was placed there by ceph-deploy (which we used
>> to deploy this cluster).  We are on xfs, but we do take tons of RBD
>> snapshots.  If either of these use cases will cause lots of osd map size
>> then, we may just be exceeding the limits of the number of rbd snapshots
>> ceph can handle (we take about 4-5000/day, 1 per RBD in the cluster)
>>
>> An interesting note, we had an OSD flap earlier this morning, and when it
>> did, immediately after it came back I checked its meta directory size with
>> du -sh, this returned immediately, and showed a size of 107GB.  The fact
>> that it returned immediately indicated to me that something had just
>> recently read through that whole directory and it was all cached in the FS
>> cache.  Normally a du -sh on the meta directory takes a good 5 minutes to
>> return.  Anyway, since it dropped this morning its meta directory size
>> continues to shrink and is down to 93GB.  So it feels like something
>> happens that makes the OSD read all its historical maps which results in
>> the OSD hanging cause there are a ton of them, and then it wakes up and
>> realizes it can delete a bunch of them...
>>
>> On Mon, Nov 30, 2015 at 2:11 PM, Dan van der Ster 
>> wrote:
>>
>>> The trick with debugging heartbeat problems is to grep back through the
>>> log to find the last thing the affected thread was doing, e.g. is
>>> 0x7f5affe72700 stuck in messaging, writing to the disk, reading through the
>>> omap, etc..
>>>
>>> I agree this doesn't look to be network related, but if you want to rule
>>> it out you should use debug_ms=1.
>>>
>>> Last week we upgraded a 1200 osd cluster from firefly to 0.94.5 and
>>> similarly started getting slow requests. To make a long story short, our
>>> issue turned out to be sendmsg blocking (very rarely), probably due to an
>>> ancient el6 kernel (these osd servers had ~800 days' uptime). The signature
>>> of this was 900s of slow requests, then an ms log showing "initiating
>>> reconnect". Until we got the kernel upgraded everywhere, we used a
>>> workaround of ms tcp read timeout = 60.
>>> So, check your kernels, and upgrade if they're ancient. Latest el6
>>> kernels work for us.
>>>
>>> Otherwise, those huge osd leveldb's don't look right. (Unless you're
>>> using tons and tons of omap...) And it kinda reminds me of the other
>>> problem we hit after the hammer upgrade, namely the return of the ever
>>> growing mon leveldb issue. The solution was to recreate the mons one by
>>> one. Perhaps you've hit something similar with the OSDs. de

[ceph-users] Ceph job posting

2015-12-01 Thread Bill Sanders
Just dropping a note to say that Teradata (I work there!) is hiring to
build out a small-at-first Ceph team in our Rancho Bernardo office (near
San Diego, CA).

We're looking for engineers interested in getting Ceph to spin like a top
for our data warehouse applications.  You should know C/C++,
virtualization, and of course Ceph.  There's a lot of exciting projects at
Teradata right now, and in increased interest in Open Source.

We have a couple junior level and a couple senior level positions open
right now.  Take a peek if you're interested:

http://teradata.jobs/jobs/?location=San+Diego%2C+CA&q=ceph+software+defined

Or, if you'd like to know more send me an email. (Disclaimer, I work for
Teradata on this team, but I'm not the hiring manager)

Bill
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD on a partition

2015-12-01 Thread Marek Dohojda
I am looking through google, and I am not seeing a good guide as to how to put 
an OSD on a partition (GPT) of a disk.  I see lots of options for file system, 
or single physical drive but not partition.  

http://dachary.org/?p=2548 

This is only thing I found but that is from 2 years ago and no comments if this 
works or not.

Is there a better guide/best practice for such a scenario?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Flapping OSDs, Large meta directories in OSDs

2015-12-01 Thread Tom Christensen
Another thing that we don't quite grasp is that when we see slow requests
now they almost always, probably 95% have the "known_if_redirected" state
set.  What does this state mean?  Does it indicate we have OSD maps that
are lagging and the cluster isn't really in sync?  Could this be the cause
of our growing osdmaps?

-Tom


On Tue, Dec 1, 2015 at 2:35 AM, HEWLETT, Paul (Paul) <
paul.hewl...@alcatel-lucent.com> wrote:

> I believe that ‘filestore xattr use omap’ is no longer used in Ceph – can
> anybody confirm this?
> I could not find any usage in the Ceph source code except that the value
> is set in some of the test software…
>
> Paul
>
>
> From: ceph-users  on behalf of Tom
> Christensen 
> Date: Monday, 30 November 2015 at 23:20
> To: "ceph-users@lists.ceph.com" 
> Subject: Re: [ceph-users] Flapping OSDs, Large meta directories in OSDs
>
> What counts as ancient?  Concurrent to our hammer upgrade we went from
> 3.16->3.19 on ubuntu 14.04.  We are looking to revert to the 3.16 kernel
> we'd been running because we're also seeing an intermittent (its happened
> twice in 2 weeks) massive load spike that completely hangs the osd node
> (we're talking about load averages that hit 20k+ before the box becomes
> completely unresponsive).  We saw a similar behavior on a 3.13 kernel,
> which resolved by moving to the 3.16 kernel we had before.  I'll try to
> catch one with debug_ms=1 and see if I can see it we're hitting a similar
> hang.
>
> To your comment about omap, we do have filestore xattr use omap = true in
> our conf... which we believe was placed there by ceph-deploy (which we used
> to deploy this cluster).  We are on xfs, but we do take tons of RBD
> snapshots.  If either of these use cases will cause lots of osd map size
> then, we may just be exceeding the limits of the number of rbd snapshots
> ceph can handle (we take about 4-5000/day, 1 per RBD in the cluster)
>
> An interesting note, we had an OSD flap earlier this morning, and when it
> did, immediately after it came back I checked its meta directory size with
> du -sh, this returned immediately, and showed a size of 107GB.  The fact
> that it returned immediately indicated to me that something had just
> recently read through that whole directory and it was all cached in the FS
> cache.  Normally a du -sh on the meta directory takes a good 5 minutes to
> return.  Anyway, since it dropped this morning its meta directory size
> continues to shrink and is down to 93GB.  So it feels like something
> happens that makes the OSD read all its historical maps which results in
> the OSD hanging cause there are a ton of them, and then it wakes up and
> realizes it can delete a bunch of them...
>
> On Mon, Nov 30, 2015 at 2:11 PM, Dan van der Ster 
> wrote:
>
>> The trick with debugging heartbeat problems is to grep back through the
>> log to find the last thing the affected thread was doing, e.g. is
>> 0x7f5affe72700 stuck in messaging, writing to the disk, reading through the
>> omap, etc..
>>
>> I agree this doesn't look to be network related, but if you want to rule
>> it out you should use debug_ms=1.
>>
>> Last week we upgraded a 1200 osd cluster from firefly to 0.94.5 and
>> similarly started getting slow requests. To make a long story short, our
>> issue turned out to be sendmsg blocking (very rarely), probably due to an
>> ancient el6 kernel (these osd servers had ~800 days' uptime). The signature
>> of this was 900s of slow requests, then an ms log showing "initiating
>> reconnect". Until we got the kernel upgraded everywhere, we used a
>> workaround of ms tcp read timeout = 60.
>> So, check your kernels, and upgrade if they're ancient. Latest el6
>> kernels work for us.
>>
>> Otherwise, those huge osd leveldb's don't look right. (Unless you're
>> using tons and tons of omap...) And it kinda reminds me of the other
>> problem we hit after the hammer upgrade, namely the return of the ever
>> growing mon leveldb issue. The solution was to recreate the mons one by
>> one. Perhaps you've hit something similar with the OSDs. debug_osd=10 might
>> be good enough to see what the osd is doing, maybe you need
>> debug_filestore=10 also. If that doesn't show the problem, bump those up to
>> 20.
>>
>> Good luck,
>>
>> Dan
>>
>> On 30 Nov 2015 20:56, "Tom Christensen"  wrote:
>> >
>> > We recently upgraded to 0.94.3 from firefly and now for the last week
>> have had intermittent slow requests and flapping OSDs.  We have been unable
>> to nail down the cause, but its feeling like it may be related to our
>> osdmaps not getting deleted properly.  Most of our osds are now storing
>> over 100GB of data in the meta directory, almost all of that is historical
>> osd maps going back over 7 days old.
>> >
>> > We did do a small cluster change (We added 35 OSDs to a 1445 OSD
>> cluster), the rebalance took about 36 hours, and it completed 10 days ago.
>> Since that time the cluster has been HEALTH_OK and all pgs have been
>> active+clean except for when we have a

Re: [ceph-users] does anyone know what xfsaild and kworker are?they make osd disk busy. produce 100-200iops per osd disk?

2015-12-01 Thread Somnath Roy
This is xfs metadata sync process...when it is waking up and there are lot of 
data to sync it will throttle all the process accessing the drive...There are 
some xfs settings to control the behavior, but you can't stop that

Sent from my iPhone

>> On Dec 1, 2015, at 8:26 AM, flisky  wrote:
>> 
>> On 2014年11月11日 12:23, duan.xuf...@zte.com.cn wrote:
>> 
>>  ZTE Information
>> Security Notice: The information contained in this mail (and any
>> attachment transmitted herewith) is privileged and confidential and is
>> intended for the exclusive use of the addressee(s). If you are not an
>> intended recipient, any disclosure, reproduction, distribution or other
>> dissemination or use of the information contained is strictly
>> prohibited. If you have received this mail in error, please delete it
>> and notify us immediately.
>> 
>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> I'm facing the exactly same problem, and my situation is much worst.
> 
> BTT gives me this -
> 
>Q2Q   MIN   AVG   MAX   N
> --- - - - ---
> ceph-osd  0.01243   0.009448228   7.065125958   12643
> kworker   0.01491   0.479659256  30.080631593 226
> pid002853761  0.000668293  20.053390778  30.080227966   3
> xfsaild   0.01097   0.008947398  30.073285005   10879
> 
>D2C   MIN   AVG   MAX   N
> --- - - - ---
> ceph-osd  0.36810   0.014268501   1.626915131   12642
> kworker   0.44483   0.005548645   0.653310778 203
> pid002853761  0.000156094   0.001594357   0.005841911   4
> xfsaild   0.000307363   0.190863515   1.3219928029849
> 
> The disk util is almost 100%, while avgrq-sz and avgqu-sz is very low, which 
> makes me very confused.
> 
> Could any one give me some hint on this? Thanks!
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache Tiering Investigation and Potential Patch

2015-12-01 Thread Nick Fisk




> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
> ow...@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: 01 December 2015 16:58
> To: Nick Fisk ; 'Sage Weil' 
> Cc: 'ceph-users' ; ceph-de...@vger.kernel.org
> Subject: Re: Cache Tiering Investigation and Potential Patch
> 
> 
> 
> On 12/01/2015 10:30 AM, Nick Fisk wrote:
> > Hi Sage/Mark,
> >
> > I have completed some initial testing of the tiering fix PR you submitted
> compared to my method I demonstrated at the perf meeting last week.
> >
> >  From a high level both have very similar performance when compared to
> the current broken behaviour. So I think until Jewel, either way would suffice
> in fixing the bug.
> >
> > I have also been running several tests with different cache sizes and
> recency settings to try and determine if there is any performance
> differences.
> >
> > The main thing I have noticed is that when it is based on actual recency
> method in your PR, you run out of adjustment resolution down the low end
> of the recency scale. The difference between objects which are in 1,2 or 3
> concurrent hit sets is quite large and dramatically affects the promotion
> behaviour. After that though, there is not much difference between setting
> it to 3 or setting it to 9, a sort of logarithmic effect. This looks like it 
> might
> have an impact on being able to tune it to the right setting to be able to 
> fill
> the cache tier. After the cache had the really hot blocks in it, the 
> promotions
> tailed off and the tier wouldn't fill up as there just wasn't any more objects
> getting hit 3 or 4 times in a row. If I dropped the recency down by 1, then
> there were too many promotions.
> >
> > In short, if you set the recency anywhere between 3-4 and max(10) then
> you were pretty much guaranteed reasonable performance with a zipf1.1
> profile that I tested with.
> >
> > With my method, it seemed to have a more linear response and hence
> more adjustment resolution, but you needed to be a bit more clever about
> picking the right number. With a zipf1.1 profile and a cache size of around
> 15% of the volume, a recency setting between 6 and 8 (out of 10 hitsets)
> provided the best performance. Higher recency meant the cache couldn't
> find hot enough objects to promote, lower resulted in too many promotions.
> I think if you take the cache size percentage, then invert it and double it, 
> this
> should give you a rough idea of the required recency setting. Ie 20% cache
> size = 6 recency for 10 hitsets. 10% cache size would be 8 for 10 hitsets.
> 
> Very interesting Nick!  thanks for digging into all of this!  Forgive me 
> since it's
> been a little while since I've thought about this, but do you see either
> method as being more amenable to autotuning?  I think ultimately we need
> to be able to deal with rejecting promotions on an as-needed basis based on
> some kind of heuristics (size + completion time perhaps).

I think a combination of the 2 methods gets you as far as you can without 
developing some sort of queue/list based system. I don't know if you had a 
chance to read through the rest of the presentation I posted after the meeting, 
but 1 slide had a bit of a brain dump where blocks jumped up a queue the hotter 
they became. I think something like that would be one way of improving it as 
you are not limited by specifying hitsets/hit_counts/hits_recency...etc

In theory something like that should be more automated as its not reliant on 
set values, rather each objects hotness competes with other objects hotness. 
Saying that, it was just something I thought about on the train into work and 
no doubt I have missed something.

Also when the promotion throttling code makes it in, that should help as well.

> 
> >
> > It could probably also do with some logic to promote really hot blocks
> faster. I'm guessing a combination of the two methods would probably be
> fairly simple to implement and provide the best gain.
> >
> > Promote IF
> > 1. Total number of hits in all hitsets > required count 2. Object is
> > in last N recent hitsets
> >
> > But as I touched on above, both of these methods are still vastly improved
> on the current code and it might be that it's not worth doing much more
> work on this, if a proper temperature based list method is likely to be
> implemented.
> >
> > I can try and get some graphs captured and jump on the perf meeting
> tomorrow if it would be useful?
> 
> That would be great if you have the time!  I may not be able to make it
> tomorrow, but I'll try to be there if I can.
> 
> >
> >
> > I also had a bit of a think about what you said regarding only keeping 1 
> > copy
> for non dirty objects and the potential write amplification involved. If we 
> had
> a similar logic to maybe_promote(), like maybe_dirty(), which would only
> dirty a block in the cache tier if it's very very hot, otherwise the write 
> gets
> proxied. That should limit the amount of objects requ

Re: [ceph-users] Cache Tiering Investigation and Potential Patch

2015-12-01 Thread Mark Nelson



On 12/01/2015 10:30 AM, Nick Fisk wrote:

Hi Sage/Mark,

I have completed some initial testing of the tiering fix PR you submitted 
compared to my method I demonstrated at the perf meeting last week.

 From a high level both have very similar performance when compared to the 
current broken behaviour. So I think until Jewel, either way would suffice in 
fixing the bug.

I have also been running several tests with different cache sizes and recency 
settings to try and determine if there is any performance differences.

The main thing I have noticed is that when it is based on actual recency method 
in your PR, you run out of adjustment resolution down the low end of the 
recency scale. The difference between objects which are in 1,2 or 3 concurrent 
hit sets is quite large and dramatically affects the promotion behaviour. After 
that though, there is not much difference between setting it to 3 or setting it 
to 9, a sort of logarithmic effect. This looks like it might have an impact on 
being able to tune it to the right setting to be able to fill the cache tier. 
After the cache had the really hot blocks in it, the promotions tailed off and 
the tier wouldn't fill up as there just wasn't any more objects getting hit 3 
or 4 times in a row. If I dropped the recency down by 1, then there were too 
many promotions.

In short, if you set the recency anywhere between 3-4 and max(10) then you were 
pretty much guaranteed reasonable performance with a zipf1.1 profile that I 
tested with.

With my method, it seemed to have a more linear response and hence more 
adjustment resolution, but you needed to be a bit more clever about picking the 
right number. With a zipf1.1 profile and a cache size of around 15% of the 
volume, a recency setting between 6 and 8 (out of 10 hitsets) provided the best 
performance. Higher recency meant the cache couldn't find hot enough objects to 
promote, lower resulted in too many promotions. I think if you take the cache 
size percentage, then invert it and double it, this should give you a rough 
idea of the required recency setting. Ie 20% cache size = 6 recency for 10 
hitsets. 10% cache size would be 8 for 10 hitsets.


Very interesting Nick!  thanks for digging into all of this!  Forgive me 
since it's been a little while since I've thought about this, but do you 
see either method as being more amenable to autotuning?  I think 
ultimately we need to be able to deal with rejecting promotions on an 
as-needed basis based on some kind of heuristics (size + completion time 
perhaps).




It could probably also do with some logic to promote really hot blocks faster. 
I'm guessing a combination of the two methods would probably be fairly simple 
to implement and provide the best gain.

Promote IF
1. Total number of hits in all hitsets > required count
2. Object is in last N recent hitsets

But as I touched on above, both of these methods are still vastly improved on 
the current code and it might be that it's not worth doing much more work on 
this, if a proper temperature based list method is likely to be implemented.

I can try and get some graphs captured and jump on the perf meeting tomorrow if 
it would be useful?


That would be great if you have the time!  I may not be able to make it 
tomorrow, but I'll try to be there if I can.





I also had a bit of a think about what you said regarding only keeping 1 copy 
for non dirty objects and the potential write amplification involved. If we had 
a similar logic to maybe_promote(), like maybe_dirty(), which would only dirty 
a block in the cache tier if it's very very hot, otherwise the write gets 
proxied. That should limit the amount of objects requiring extra copies to be 
generated every time there is a write. The end user may also want to turn off 
write caching altogether so that all writes are proxied to take advantage of 
larger read cache.

Nick


-Original Message-
From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
ow...@vger.kernel.org] On Behalf Of Sage Weil
Sent: 25 November 2015 20:41
To: Nick Fisk 
Cc: 'ceph-users' ; ceph-de...@vger.kernel.org;
'Mark Nelson' 
Subject: RE: Cache Tiering Investigation and Potential Patch

On Wed, 25 Nov 2015, Nick Fisk wrote:

Yes I think that should definitely be an improvement. I can't
quite get my head around how it will perform in instances where
you miss 1 hitset but all others are a hit. Like this:

H H H M H H H H H H H H

And recency is set to 8 for example. It maybe that it doesn't have
much effect on the overall performance. It might be that there is
a strong separation of really hot blocks and hot blocks, but this
could turn out to be a good thing.


Yeah... In the above case recency 3 would be enough (or 9, depending
on whether that's chronological or reverse chronological order).
Doing an N out of M or similar is a bit more flexible and probably
something we should add on top.  (Or, we could change recency to be
N/M instead of just
N.)


N out of M, is th

Re: [ceph-users] Cache Tiering Investigation and Potential Patch

2015-12-01 Thread Nick Fisk
Hi Sage/Mark,

I have completed some initial testing of the tiering fix PR you submitted 
compared to my method I demonstrated at the perf meeting last week.

>From a high level both have very similar performance when compared to the 
>current broken behaviour. So I think until Jewel, either way would suffice in 
>fixing the bug.

I have also been running several tests with different cache sizes and recency 
settings to try and determine if there is any performance differences.

The main thing I have noticed is that when it is based on actual recency method 
in your PR, you run out of adjustment resolution down the low end of the 
recency scale. The difference between objects which are in 1,2 or 3 concurrent 
hit sets is quite large and dramatically affects the promotion behaviour. After 
that though, there is not much difference between setting it to 3 or setting it 
to 9, a sort of logarithmic effect. This looks like it might have an impact on 
being able to tune it to the right setting to be able to fill the cache tier. 
After the cache had the really hot blocks in it, the promotions tailed off and 
the tier wouldn't fill up as there just wasn't any more objects getting hit 3 
or 4 times in a row. If I dropped the recency down by 1, then there were too 
many promotions.

In short, if you set the recency anywhere between 3-4 and max(10) then you were 
pretty much guaranteed reasonable performance with a zipf1.1 profile that I 
tested with.

With my method, it seemed to have a more linear response and hence more 
adjustment resolution, but you needed to be a bit more clever about picking the 
right number. With a zipf1.1 profile and a cache size of around 15% of the 
volume, a recency setting between 6 and 8 (out of 10 hitsets) provided the best 
performance. Higher recency meant the cache couldn't find hot enough objects to 
promote, lower resulted in too many promotions. I think if you take the cache 
size percentage, then invert it and double it, this should give you a rough 
idea of the required recency setting. Ie 20% cache size = 6 recency for 10 
hitsets. 10% cache size would be 8 for 10 hitsets.

It could probably also do with some logic to promote really hot blocks faster. 
I'm guessing a combination of the two methods would probably be fairly simple 
to implement and provide the best gain.

Promote IF
1. Total number of hits in all hitsets > required count
2. Object is in last N recent hitsets

But as I touched on above, both of these methods are still vastly improved on 
the current code and it might be that it's not worth doing much more work on 
this, if a proper temperature based list method is likely to be implemented.

I can try and get some graphs captured and jump on the perf meeting tomorrow if 
it would be useful?


I also had a bit of a think about what you said regarding only keeping 1 copy 
for non dirty objects and the potential write amplification involved. If we had 
a similar logic to maybe_promote(), like maybe_dirty(), which would only dirty 
a block in the cache tier if it's very very hot, otherwise the write gets 
proxied. That should limit the amount of objects requiring extra copies to be 
generated every time there is a write. The end user may also want to turn off 
write caching altogether so that all writes are proxied to take advantage of 
larger read cache. 

Nick

> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
> ow...@vger.kernel.org] On Behalf Of Sage Weil
> Sent: 25 November 2015 20:41
> To: Nick Fisk 
> Cc: 'ceph-users' ; ceph-de...@vger.kernel.org;
> 'Mark Nelson' 
> Subject: RE: Cache Tiering Investigation and Potential Patch
> 
> On Wed, 25 Nov 2015, Nick Fisk wrote:
> > > > Yes I think that should definitely be an improvement. I can't
> > > > quite get my head around how it will perform in instances where
> > > > you miss 1 hitset but all others are a hit. Like this:
> > > >
> > > > H H H M H H H H H H H H
> > > >
> > > > And recency is set to 8 for example. It maybe that it doesn't have
> > > > much effect on the overall performance. It might be that there is
> > > > a strong separation of really hot blocks and hot blocks, but this
> > > > could turn out to be a good thing.
> > >
> > > Yeah... In the above case recency 3 would be enough (or 9, depending
> > > on whether that's chronological or reverse chronological order).
> > > Doing an N out of M or similar is a bit more flexible and probably
> > > something we should add on top.  (Or, we could change recency to be
> > > N/M instead of just
> > > N.)
> >
> > N out of M, is that similar to what I came up with but combined with
> > the N most recent sets?
> 
> Yeah
> 
> > If you can wait a couple of days I will run the PR in its current
> > state through my test box and see how it looks.
> 
> Sounds great, thanks.
> 
> > Just a quick question, is there a way to just make+build the changed
> > files/package or select just to build the main ceph.deb. I'm just
> > using

Re: [ceph-users] does anyone know what xfsaild and kworker are?they make osd disk busy. produce 100-200iops per osd disk?

2015-12-01 Thread flisky
On 2014年11月11日 12:23, 
duan.xuf...@zte.com.cn wrote:


 ZTE Information
Security Notice: The information contained in this mail (and any
attachment transmitted herewith) is privileged and confidential and is
intended for the exclusive use of the addressee(s). If you are not an
intended recipient, any disclosure, reproduction, distribution or other
dissemination or use of the information contained is strictly
prohibited. If you have received this mail in error, please delete it
and notify us immediately.




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


I'm facing the exactly same problem, and my situation is much worst.

BTT gives me this -

Q2Q   MIN   AVG   MAX   N
--- - - - ---
ceph-osd  0.01243   0.009448228   7.065125958   12643
kworker   0.01491   0.479659256  30.080631593 226
pid002853761  0.000668293  20.053390778  30.080227966   3
xfsaild   0.01097   0.008947398  30.073285005   10879

D2C   MIN   AVG   MAX   N
--- - - - ---
ceph-osd  0.36810   0.014268501   1.626915131   12642
kworker   0.44483   0.005548645   0.653310778 203
pid002853761  0.000156094   0.001594357   0.005841911   4
xfsaild   0.000307363   0.190863515   1.3219928029849

The disk util is almost 100%, while avgrq-sz and avgqu-sz is very low, 
which makes me very confused.


Could any one give me some hint on this? Thanks!

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Would HEALTH_DISASTER be a good addition?

2015-12-01 Thread Wido den Hollander


On 26-11-15 07:58, Wido den Hollander wrote:
> On 11/25/2015 10:46 PM, Gregory Farnum wrote:
>> On Wed, Nov 25, 2015 at 11:09 AM, Wido den Hollander  wrote:
>>> Hi,
>>>
>>> Currently we have OK, WARN and ERR as states for a Ceph cluster.
>>>
>>> Now, it could happen that while a Ceph cluster is in WARN state certain
>>> PGs are not available due to being in peering or any non-active+? state.
>>>
>>> When monitoring a Ceph cluster you usually want to see OK and not worry
>>> when a cluster is in WARN.
>>>
>>> However, with the current situation you need to check if there are any
>>> PGs in a non-active state since that means they are currently not doing
>>> any I/O.
>>>
>>> For example, size is to 3, min_size is set to 2. One OSD fails, cluster
>>> starts to recover/backfill. A second OSD fails which causes certain PGs
>>> to become undersized and no longer serve I/O.
>>>
>>> I've seen such situations happen multiple times. VMs running and a few
>>> PGs become non-active which caused about all I/O to stop effectively.
>>>
>>> The health stays in WARN, but a certain part of it is not serving I/O.
>>>
>>> My suggestion would be:
>>>
>>> OK: All PGs are active+clean and no other issues
>>> WARN: All PGs are active+? (degraded, recovery_wait, backfilling, etc)
>>> ERR: One or more PGs are not active
>>> DISASTER: Anything which currently triggers ERR
>>>
>>> This way you can monitor for ERR. If the cluster goes into >= ERR you
>>> know you have to come into action. <= WARN is just a thing you might
>>> want to look in to, but not at 03:00 on Sunday morning.
>>>
>>> Does this sound reasonable?
>>
>> It sounds like basically you want a way of distinguishing between
>> manual intervention required, and bad states which are going to be
>> repaired on their own. That sounds like a good idea to me, but I'm not
>> sure how feasible the specific thing here is. How long does a PG need
>> to be in a not-active state before you shift into the alert mode? They
>> can go through peering for a second or so when a node dies, and that
>> will block IO but probably shouldn't trigger alerts.
> 
> Hmm, let's say:
> 
> mon_pg_inactive_timeout = 30
> 
> If one or more PGs is inactive longer than 30 seconds we go in to error
> state. This gives us time to go through peering where needed.
> 
> If that isn't resolved within 30 seconds we switch to HEALTH_ERR. Admins
> can monitor for HEALTH_ERR and send out an alert when that happens.
> 
> This way you can ignore HEALTH_WARN since you know all I/O is continuing.
> 

I created a issue for this: http://tracker.ceph.com/issues/13923

Wido

>> -Greg
>>
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] F21 pkgs for Ceph Hammer release ?

2015-12-01 Thread Deepak Shetty
Hi,
 Does anybody how/where I can get the F21 repo for ceph hammer release ?

In download.ceph.com/rpm-hammer/ I only see F20 dir, not F21

F21 distro repo only carries firefly release, but I want to install Ceph
Hammer, hence the Q

thanx,
deepak
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Flapping OSDs, Large meta directories in OSDs

2015-12-01 Thread HEWLETT, Paul (Paul)
I believe that ‘filestore xattr use omap’ is no longer used in Ceph – can 
anybody confirm this?
I could not find any usage in the Ceph source code except that the value is set 
in some of the test software…

Paul


From: ceph-users 
mailto:ceph-users-boun...@lists.ceph.com>> 
on behalf of Tom Christensen mailto:pav...@gmail.com>>
Date: Monday, 30 November 2015 at 23:20
To: "ceph-users@lists.ceph.com" 
mailto:ceph-users@lists.ceph.com>>
Subject: Re: [ceph-users] Flapping OSDs, Large meta directories in OSDs

What counts as ancient?  Concurrent to our hammer upgrade we went from 
3.16->3.19 on ubuntu 14.04.  We are looking to revert to the 3.16 kernel we'd 
been running because we're also seeing an intermittent (its happened twice in 2 
weeks) massive load spike that completely hangs the osd node (we're talking 
about load averages that hit 20k+ before the box becomes completely 
unresponsive).  We saw a similar behavior on a 3.13 kernel, which resolved by 
moving to the 3.16 kernel we had before.  I'll try to catch one with debug_ms=1 
and see if I can see it we're hitting a similar hang.

To your comment about omap, we do have filestore xattr use omap = true in our 
conf... which we believe was placed there by ceph-deploy (which we used to 
deploy this cluster).  We are on xfs, but we do take tons of RBD snapshots.  If 
either of these use cases will cause lots of osd map size then, we may just be 
exceeding the limits of the number of rbd snapshots ceph can handle (we take 
about 4-5000/day, 1 per RBD in the cluster)

An interesting note, we had an OSD flap earlier this morning, and when it did, 
immediately after it came back I checked its meta directory size with du -sh, 
this returned immediately, and showed a size of 107GB.  The fact that it 
returned immediately indicated to me that something had just recently read 
through that whole directory and it was all cached in the FS cache.  Normally a 
du -sh on the meta directory takes a good 5 minutes to return.  Anyway, since 
it dropped this morning its meta directory size continues to shrink and is down 
to 93GB.  So it feels like something happens that makes the OSD read all its 
historical maps which results in the OSD hanging cause there are a ton of them, 
and then it wakes up and realizes it can delete a bunch of them...

On Mon, Nov 30, 2015 at 2:11 PM, Dan van der Ster 
mailto:dvand...@gmail.com>> wrote:

The trick with debugging heartbeat problems is to grep back through the log to 
find the last thing the affected thread was doing, e.g. is 0x7f5affe72700 stuck 
in messaging, writing to the disk, reading through the omap, etc..

I agree this doesn't look to be network related, but if you want to rule it out 
you should use debug_ms=1.

Last week we upgraded a 1200 osd cluster from firefly to 0.94.5 and similarly 
started getting slow requests. To make a long story short, our issue turned out 
to be sendmsg blocking (very rarely), probably due to an ancient el6 kernel 
(these osd servers had ~800 days' uptime). The signature of this was 900s of 
slow requests, then an ms log showing "initiating reconnect". Until we got the 
kernel upgraded everywhere, we used a workaround of ms tcp read timeout = 60.
So, check your kernels, and upgrade if they're ancient. Latest el6 kernels work 
for us.

Otherwise, those huge osd leveldb's don't look right. (Unless you're using tons 
and tons of omap...) And it kinda reminds me of the other problem we hit after 
the hammer upgrade, namely the return of the ever growing mon leveldb issue. 
The solution was to recreate the mons one by one. Perhaps you've hit something 
similar with the OSDs. debug_osd=10 might be good enough to see what the osd is 
doing, maybe you need debug_filestore=10 also. If that doesn't show the 
problem, bump those up to 20.

Good luck,

Dan

On 30 Nov 2015 20:56, "Tom Christensen" 
mailto:pav...@gmail.com>> wrote:
>
> We recently upgraded to 0.94.3 from firefly and now for the last week have 
> had intermittent slow requests and flapping OSDs.  We have been unable to 
> nail down the cause, but its feeling like it may be related to our osdmaps 
> not getting deleted properly.  Most of our osds are now storing over 100GB of 
> data in the meta directory, almost all of that is historical osd maps going 
> back over 7 days old.
>
> We did do a small cluster change (We added 35 OSDs to a 1445 OSD cluster), 
> the rebalance took about 36 hours, and it completed 10 days ago.  Since that 
> time the cluster has been HEALTH_OK and all pgs have been active+clean except 
> for when we have an OSD flap.
>
> When the OSDs flap they do not crash and restart, they just go unresponsive 
> for 1-3 minutes, and then come back alive all on their own.  They get marked 
> down by peers, and cause some peering and then they just come back rejoin the 
> cluster and continue on their merry way.
>
> We see a bunch of this in the logs while the OSD is catatonic:
>
> Nov 30 11:2

Re: [ceph-users] State of nfs-ganesha CEPH fsal

2015-12-01 Thread Dennis Kramer (DT)
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,

I've been testing below options out, but I still have the same problem
that files are not visible on different clients. After a "touch" of a
new file (or directory) all files are visible again. It definitly
looks like a directory cache problem.

Client mount options like "noac" or "actimeo=0" solved it for some,
but after a while the clients ran into the same problem again. I'm a
bit at a loss here, so hopefully someone can shed some more light on
this annoying problem.

It seems when I restart the NFS-server, the problem disappears for a
while. After a week or so, the problem resurfaced.

I've used the following options for the NFS-Ganesha config:
NFSv4
{
DomainName = "<>";
IdmapConf = "/etc/idmapd.conf";
}
NFS_KRB5
{
Active_krb5 = false;
}

NFS_DupReq_Hash
{
Index_Size = 17 ;
Alphabet_Length = 10 ;
}

NFSv4_ClientId_Cache
{
Index_Size = 17 ;
Alphabet_Length = 10 ;
}

CEPH
{
}

CacheInode_Client
{
Entry_Prealloc_PoolSize = 1000 ;
Attr_Expiration_Time = Immediate ;
Symlink_Expiration_Time = Immediate ;
Directory_Expiration_Time = Immediate ;
Use_Test_Access = 1 ;
}

CacheInode
{
Attr_Expiration_Time = 0 ;
Use_Getattr_Directory_Invalidation = true;
}

EXPORT_DEFAULTS
{
Disable_ACL = FALSE;
SecType = "sys";
Protocols = "4";
Transports = "TCP";
Manage_Gids = TRUE;
}

EXPORT
{
Export_ID=1;
FSAL {
Name = Ceph;
}
Path = "/DATA/SHARE";
Pseudo = "/DATA";
Tag = "DATA";
CLIENT {
Clients = 172.17.0.0/16;
Access_Type = RW;
Squash = Root;
}
}


With regards,


On 10/28/2015 05:37 PM, Lincoln Bryant wrote:
> Hi Dennis,
> 
> We're using NFS Ganesha here as well. I can send you my
> configuration which is working but we squash users and groups down
> to a particular uid/gid, so it may not be super helpful for you.
> 
> I think files not being immediately visible is working as intended,
> due to directory caching. I _believe_ what you need to do is set
> the following (comments shamelessly stolen from the Gluster FSAL): 
> # If thuis flag is set to yes, a getattr is performed each time a
> readdir is done # if mtime do not match, the directory is renewed.
> This will make the cache more # synchronous to the FSAL, but will
> strongly decrease the directory cache performance 
> Use_Getattr_Directory_Invalidation = true;
> 
> Hope that helps.
> 
> Thanks, Lincoln
> 
>> On Oct 28, 2015, at 9:08 AM, Dennis Kramer (DT)
>>  wrote:
>> 
> Sorry for raising this topic from the dead, but i'm having the
> same issues with NFS-GANESHA /w the wrong user/group information.
> 
> Do you maybe have a working ganesha.conf? I'm assuming I might 
> mis-configured something in this file. It's also nice to have some 
> reference config file from a working FSAL CEPH, the sample config
> is very minimalistic.
> 
> I also have another issue with files that are not immediately
> visible in a NFS folder after another system (using the same NFS)
> has created it. There seems to be a slight delay before all system
> have the same directory listing. This can be enforced by creating a
> *new* file in this directory which will cause a refresh on this
> folder. Changing directories also helps on affected system(s).
> 
> On 07/28/2015 11:30 AM, Haomai Wang wrote:
 On Tue, Jul 28, 2015 at 5:28 PM, Burkhard Linke 
  wrote:
> Hi,
> 
> On 07/28/2015 11:08 AM, Haomai Wang wrote:
>> 
>> On Tue, Jul 28, 2015 at 4:47 PM, Gregory Farnum 
>>  wrote:
>>> 
>>> On Tue, Jul 28, 2015 at 8:01 AM, Burkhard Linke 
>>> 
>>> wrote:
> 
> 
> *snipsnap*
 
 Can you give some details on that issues? I'm
 currently looking for a way to provide NFS based
 access to CephFS to our desktop machines.
>>> 
>>> Ummm...sadly I can't; we don't appear to have any
>>> tracker tickets and I'm not sure where the report went
>>> to. :( I think it was from Haomai...
>> 
>> My fault, I should report this to ticket.
>> 
>> I have forgotten the details about the problem, I submit
>> the infos to IRC :-(
>> 
>> It related to the "ls" output. It will print the wrong 
>> user/group owner as "-1", maybe related to root squash?
> 
> Are you sure this problem is related to the CephFS FSAL? I
> also had a hard time setting up ganesha correctly,
> especially with respect to user and group mappings,
> especially with a kerberized setup.
> 
> I'm currently running a small test setup with one server
> and one client to single out the last kerberos related
> problems (nfs-ganesha 2.2.0 / Ceph Hammer 0.94.2 / Ubuntu
> 14.04). User/group listings have been OK so far. Do you
> remember whether the problem occ

Re: [ceph-users] rbd_inst.create

2015-12-01 Thread NEVEU Stephane
Ok thank you Jason.

[@@ THALES GROUP INTERNAL @@]

-Message d'origine-
De : Jason Dillaman [mailto:dilla...@redhat.com] 
Envoyé : lundi 30 novembre 2015 15:38
À : NEVEU Stephane
Cc : Ceph Users; Gregory Farnum
Objet : Re: [ceph-users] rbd_inst.create

... and once you create a pool-level snapshot on a pool, there is no way to 
convert that pool back to being compatible with RBD self-managed snapshots.

As for the RBD image feature bits, they are defined within rbd.py.  On master, 
they currently are as follows:

RBD_FEATURE_LAYERING = 1
RBD_FEATURE_STRIPINGV2 = 2
RBD_FEATURE_EXCLUSIVE_LOCK = 4
RBD_FEATURE_OBJECT_MAP = 8
RBD_FEATURE_FAST_DIFF = 16
RBD_FEATURE_DEEP_FLATTEN = 32
RBD_FEATURE_JOURNALING = 64

-- 

Jason Dillaman 


- Original Message - 

> From: "Gregory Farnum" 
> To: "NEVEU Stephane" 
> Cc: "Ceph Users" 
> Sent: Monday, November 30, 2015 8:17:17 AM
> Subject: Re: [ceph-users] rbd_inst.create

> On Nov 27, 2015 3:34 AM, "NEVEU Stephane" < 
> stephane.ne...@thalesgroup.com >
> wrote:
> >
> > Ok, I think I got it. It seems to come from here :
> >
> > tracker.ceph.com/issues/6047
> >
> >
> >
> > I’m trying to snapshot an image while I previously made a snapshot 
> > of my pool… whereas it just works fine when using a brand new pool. 
> > I’m using ceph v0.80.10 on Ubuntu 14.04. As I see, it has been 
> > patched since dumpling. Could it be a regression ?
> Pool snapshots and the "self-managed" snapshots used by rbd are incompatible.
> You have to pick one or the other on each pool.
> >
> >
> >
> >
> >
> >
> >
> > De : ceph-users [mailto: ceph-users-boun...@lists.ceph.com ] De la 
> > part de NEVEU Stephane Envoyé : jeudi 26 novembre 2015 15:49 À : 
> > ceph-users@lists.ceph.com Objet : [ceph-users] rbd_inst.create
> >
> >
> >
> > Hi all,
> >
> >
> >
> > I’m using python scripts to create rbd images like described here 
> > http://docs.ceph.com/docs/giant/rbd/librbdpy/
> >
> > rbd_inst.create(ioctx, 'myimage', size, old_format=False, 
> > features=1) seems to create a layering image
> >
> > rbd_inst.create(ioctx, 'myimage', size, old_format=False, 
> > features=2) seems to create a stripped image
> >
> >
> >
> > Setting up “rbd default format =2” in ceph.conf and just using the 
> > following (without feature=x)
> >
> > rbd_inst.create(ioctx, 'myimage', size) seems to create a layered + 
> > stripped image
> >
> >
> >
> > If someone could point me to the documentation about those bitmasks 
> > (features), that would be great J I cannot find it.
> >
> >
> >
> > Moreover, when my images are created this way (using rbd_inst.create 
> > with python), no way to snapshot an image !
> >
> > #rbd snap create rbd/myimage@snap1
> >
> > …. -1 librbd: failed to create snap id: (22) Invalid argument
> >
> >
> >
> > Same thing with img.create_snap(snap) in python, snapshots are not created.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > [@@ THALES GROUP INTERNAL @@]
> >
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] High 0.94.5 OSD memory use at 8GB RAM/TB raw disk during recovery

2015-12-01 Thread Laurent GUERBY
On Mon, 2015-11-30 at 18:58 -0600, Mark Nelson wrote:
> Hi Laurent,
> 
> Wow, that's excessive!  I'd see if anyone else has any tricks first, but 
> if nothing else helps, running an OSD through valgrind with massif will 
> probably help pinpoint what's going on.  Have you tweaked the recovery 
> tunables at all?
> Oh, forgot to ask, any core dumps?
> Mark

Hi Mark,

The only options we've touched are:
"osd_max_backfills": "1",
"osd_recovery_max_active": "1",
"osd_recovery_op_priority": "1",
Plus all the "noxxx" of ceph -s below.

Do you have in mind other options we could tweak?

We have no core dump yet, Mehdi is trying a heap dump on some OSD.

When looking at "ceph tell osd.X heap stats" on OSD there is nothing in
freelist all "in use", here is an OSD with 19G RAM for 2 TB disk:

25118 root  20   0 19.030g 0.014t   2568 S  38.2 46.4
8:43.55 /usr/bin/ceph-osd --cluster=ceph -i 2
-f  
   
/dev/sdb1   1.9T  1.6T  271G  86% /var/lib/ceph/osd/ceph-2

root@g3:~# ceph tell osd.2 heap stats
osd.2 tcmalloc heap
stats:
MALLOC:18568498424 (17708.3 MiB) Bytes in use by application
MALLOC: +189464576 (  180.7 MiB) Bytes in page heap freelist
MALLOC: +210782296 (  201.0 MiB) Bytes in central cache freelist
MALLOC: +  4416048 (4.2 MiB) Bytes in transfer cache freelist
MALLOC: + 29157504 (   27.8 MiB) Bytes in thread cache freelists
MALLOC: + 80457888 (   76.7 MiB) Bytes in malloc metadata
MALLOC:   
MALLOC: =  19082776736 (18198.8 MiB) Actual memory used (physical +
swap)
MALLOC: +   393216 (0.4 MiB) Bytes released to OS (aka unmapped)
MALLOC:   
MALLOC: =  19083169952 (18199.1 MiB) Virtual address space used
MALLOC:
MALLOC:1150671  Spans in use
MALLOC:342  Thread heaps in use
MALLOC:   8192  Tcmalloc page size

Call ReleaseFreeMemory() to release freelist memory to the OS (via
madvise()).
Bytes released to the OS take up virtual address space but no physical
memory.

Log since latest restart of the 19G OSD below with ceph -s and ceph osd
tree (it's rarely below two OSD down, oscillate between 2 and 10 down).

Sincerely,

Laurent

2015-12-01 08:32:13.928973 7f0554016900  0 ceph version 0.94.5-164-gbf9e1b6 
(bf9e1b6307692fdc50465a64590d83e3d7015c9d), process ceph-osd, pid 25118
2015-12-01 08:32:13.939219 7f0554016900  0 filestore(/var/lib/ceph/osd/ceph-2) 
backend xfs (magic 0x58465342)
2015-12-01 08:32:14.071146 7f0554016900  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-2) detect_features: FIEMAP ioctl 
is supported and appears to work
2015-12-01 08:32:14.071163 7f0554016900  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-2) detect_features: FIEMAP ioctl 
is disabled via 'filestore fiemap' config option
2015-12-01 08:32:14.186866 7f0554016900  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-2) detect_features: syncfs(2) 
syscall fully supported (by glibc and kernel)
2015-12-01 08:32:14.186996 7f0554016900  0 
xfsfilestorebackend(/var/lib/ceph/osd/ceph-2) detect_feature: extsize is 
supported and kernel 3.19.0-32-generic >= 3.5
2015-12-01 08:32:15.649088 7f0554016900  0 filestore(/var/lib/ceph/osd/ceph-2) 
mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
2015-12-01 08:32:29.476440 7f0554016900  1 journal _open 
/var/lib/ceph/osd/ceph-2/journal fd 19: 5367660544 bytes, block size 4096 
bytes, directio = 1, aio = 1
2015-12-01 08:32:29.540794 7f0554016900  1 journal _open 
/var/lib/ceph/osd/ceph-2/journal fd 19: 5367660544 bytes, block size 4096 
bytes, directio = 1, aio = 1
2015-12-01 08:32:29.597359 7f0554016900  0  cls/hello/cls_hello.cc:271: 
loading cls_hello
2015-12-01 08:32:29.623116 7f0554016900  0 osd.2 214588 crush map has features 
2303210029056, adjusting msgr requires for clients
2015-12-01 08:32:29.623132 7f0554016900  0 osd.2 214588 crush map has features 
2578087936000 was 8705, adjusting msgr requires for mons
2015-12-01 08:32:29.623138 7f0554016900  0 osd.2 214588 crush map has features 
2578087936000, adjusting msgr requires for osds
2015-12-01 08:32:29.623158 7f0554016900  0 osd.2 214588 load_pgs
2015-12-01 08:34:44.563581 7f0554016900  0 osd.2 214588 load_pgs opened 1496 pgs
2015-12-01 08:34:44.564545 7f0546afe700  0 -- 192.168.99.253:6811/25118 >> :/0 
pipe(0xad4e000 sd=27 :6811 s=0 pgs=0 cs=0 l=0 c=0xb673).accept failed to 
getpeername (107) Transport endpoint is not connected
2015-12-01 08:34:44.564600 7f05466bf700  0 -- 192.168.99.253:6811/25118 >> :/0 
pipe(0x893c000 sd=32 :6811 s=0 pgs=0 cs=0 l=0 c=0xb6730420).accept failed to 
getpeername (107) Transport endpoint is not connected
2015-12-01 08:34:44.577168 7f0554016900 -1 osd.2 214588 log_to_monitors 
{default=true}
2015-12-01 08:34:44.579147 7f052b4ff700

Re: [ceph-users] Flapping OSDs, Large meta directories in OSDs

2015-12-01 Thread Dan van der Ster
On Tue, Dec 1, 2015 at 12:20 AM, Tom Christensen  wrote:
> What counts as ancient?  Concurrent to our hammer upgrade we went from
> 3.16->3.19 on ubuntu 14.04.  We are looking to revert to the 3.16 kernel
> we'd been running because we're also seeing an intermittent (its happened
> twice in 2 weeks) massive load spike that completely hangs the osd node
> (we're talking about load averages that hit 20k+ before the box becomes
> completely unresponsive).  We saw a similar behavior on a 3.13 kernel, which
> resolved by moving to the 3.16 kernel we had before.  I'll try to catch one
> with debug_ms=1 and see if I can see it we're hitting a similar hang.

In our case we had upgraded from  2.6.32-358.14.1.el6.x86_64 to
2.6.32-573.8.1.el6.

> To your comment about omap, we do have filestore xattr use omap = true in
> our conf... which we believe was placed there by ceph-deploy (which we used
> to deploy this cluster).  We are on xfs, but we do take tons of RBD
> snapshots.  If either of these use cases will cause lots of osd map size
> then, we may just be exceeding the limits of the number of rbd snapshots
> ceph can handle (we take about 4-5000/day, 1 per RBD in the cluster)

That sure sounds like a lot of snapshots, but I don't know if it would
cause a problem.

> An interesting note, we had an OSD flap earlier this morning, and when it
> did, immediately after it came back I checked its meta directory size with
> du -sh, this returned immediately, and showed a size of 107GB.  The fact
> that it returned immediately indicated to me that something had just
> recently read through that whole directory and it was all cached in the FS
> cache.  Normally a du -sh on the meta directory takes a good 5 minutes to
> return.  Anyway, since it dropped this morning its meta directory size
> continues to shrink and is down to 93GB.  So it feels like something happens
> that makes the OSD read all its historical maps which results in the OSD
> hanging cause there are a ton of them, and then it wakes up and realizes it
> can delete a bunch of them...

If all OSDs are up and all PGs are active+clean I don't know what
would cause an OSD to need old maps. The debug logs should help.

Another tool to use is perf top. With that you can see if the OSD is
busy in some leveldb operation, e.g. compression, or something else...

Cheers, Dan

> On Mon, Nov 30, 2015 at 2:11 PM, Dan van der Ster 
> wrote:
>>
>> The trick with debugging heartbeat problems is to grep back through the
>> log to find the last thing the affected thread was doing, e.g. is
>> 0x7f5affe72700 stuck in messaging, writing to the disk, reading through the
>> omap, etc..
>>
>> I agree this doesn't look to be network related, but if you want to rule
>> it out you should use debug_ms=1.
>>
>> Last week we upgraded a 1200 osd cluster from firefly to 0.94.5 and
>> similarly started getting slow requests. To make a long story short, our
>> issue turned out to be sendmsg blocking (very rarely), probably due to an
>> ancient el6 kernel (these osd servers had ~800 days' uptime). The signature
>> of this was 900s of slow requests, then an ms log showing "initiating
>> reconnect". Until we got the kernel upgraded everywhere, we used a
>> workaround of ms tcp read timeout = 60.
>> So, check your kernels, and upgrade if they're ancient. Latest el6 kernels
>> work for us.
>>
>> Otherwise, those huge osd leveldb's don't look right. (Unless you're using
>> tons and tons of omap...) And it kinda reminds me of the other problem we
>> hit after the hammer upgrade, namely the return of the ever growing mon
>> leveldb issue. The solution was to recreate the mons one by one. Perhaps
>> you've hit something similar with the OSDs. debug_osd=10 might be good
>> enough to see what the osd is doing, maybe you need debug_filestore=10 also.
>> If that doesn't show the problem, bump those up to 20.
>>
>> Good luck,
>>
>> Dan
>>
>> On 30 Nov 2015 20:56, "Tom Christensen"  wrote:
>> >
>> > We recently upgraded to 0.94.3 from firefly and now for the last week
>> > have had intermittent slow requests and flapping OSDs.  We have been unable
>> > to nail down the cause, but its feeling like it may be related to our
>> > osdmaps not getting deleted properly.  Most of our osds are now storing 
>> > over
>> > 100GB of data in the meta directory, almost all of that is historical osd
>> > maps going back over 7 days old.
>> >
>> > We did do a small cluster change (We added 35 OSDs to a 1445 OSD
>> > cluster), the rebalance took about 36 hours, and it completed 10 days ago.
>> > Since that time the cluster has been HEALTH_OK and all pgs have been
>> > active+clean except for when we have an OSD flap.
>> >
>> > When the OSDs flap they do not crash and restart, they just go
>> > unresponsive for 1-3 minutes, and then come back alive all on their own.
>> > They get marked down by peers, and cause some peering and then they just
>> > come back rejoin the cluster and continue on their merry way.
>> >
>> > We see