Re: [ceph-users] Pgs stuck on undersized+degraded+peered

2016-12-09 Thread Christian Wuerdig
Hi,

it's useful to generally provide some detail around the setup, like:
What are your pool settings - size and min_size?
What is your failure domain - osd or host?
What version of ceph are you running on which OS?

You can check which specific PGs are problematic by running "ceph health
detail" and then you can use "ceph pg x.y query" (where x.y is a
problematic PG identified from ceph health).
http://docs.ceph.com/docs/jewel/rados/troubleshooting/troubleshooting-pg/
might provide you some pointers.

One obvious fix would be to get your 3rd osd server up and running again -
but I guess you're already working on this.

Cheers
Christian

On Sat, Dec 10, 2016 at 7:25 AM, fridifree  wrote:

> Hi,
> 1 of 3 of my osd servers is down and I get this error
> And I do not have any access to rbds on the cluster
>
> Any suggestions?
>
> Thank you
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS FAILED assert(dn->get_linkage()->is_null())

2016-12-09 Thread Chris Sarginson
Hi Goncarlo,

In the end we ascertained that the assert was coming from reading corrupt
data in the mds journal.  We have followed the sections at the following
link (http://docs.ceph.com/docs/jewel/cephfs/disaster-recovery/) in order
down to (and including) MDS Table wipes (only wiping the "session" table in
the final step).  This resolved the problem we had with our mds asserting.

We have also run a cephfs scrub to validate the data (ceph daemon mds.0
scrub_path / recursive repair), which has resulted in "metadata damage
detected" health warning.  This seems to perform a read of all objects
involved in cephfs rados pools (anecdotal: performance of the scan against
the data pool was much faster to process than the metadata pool itself).

We are now working with the output of "ceph tell mds.0 damage ls", and
looking at the following mailing list post as a starting point for
proceeding with that:
http://ceph-users.ceph.narkive.com/EfFTUPyP/how-to-fix-the-mds-damaged-issue

Chris

On Fri, 9 Dec 2016 at 19:26 Goncalo Borges 
wrote:

> Hi Sean, Rob.
>
> I saw on the tracker that you were able to resolve the mds assert by
> manually cleaning the corrupted metadata. Since I am also hitting that
> issue and I suspect that i will face an mds assert of the same type sooner
> or later, can you please explain a bit further what operations did you do
> to clean the problem?
> Cheers
> Goncalo
> 
> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Rob
> Pickerill [r.picker...@gmail.com]
> Sent: 09 December 2016 07:13
> To: Sean Redmond; John Spray
> Cc: ceph-users
> Subject: Re: [ceph-users] CephFS FAILED
> assert(dn->get_linkage()->is_null())
>
> Hi John / All
>
> Thank you for the help so far.
>
> To add a further point to Sean's previous email, I see this log entry
> before the assertion failure:
>
> -6> 2016-12-08 15:47:08.483700 7fb133dca700 12
> mds.0.cache.dir(1000a453344) remove_dentry [dentry
> #100/stray9/1000a453344/config [2,head] auth NULL (dver
> sion lock) v=540 inode=0 0x55e8664fede0]
> -5> 2016-12-08 15:47:08.484882 7fb133dca700 -1 mds/CDir.cc: In
> function 'void CDir::try_remove_dentries_for_stray()' thread 7fb133dca700
> time 2016-12-08
> 15:47:08.483704
> mds/CDir.cc: 699: FAILED assert(dn->get_linkage()->is_null())
>
> And I can reference this with:
>
> root@ceph-mon1:~/1000a453344# rados -p ven-ceph-metadata-1 listomapkeys
> 1000a453344.
> 1470734502_head
> config_head
>
> Would we also need to clean up this object, if so is there a safe we can
> do this?
>
> Rob
>
> On Thu, 8 Dec 2016 at 19:58 Sean Redmond > wrote:
> Hi John,
>
> Thanks for your pointers, I have extracted the onmap_keys and onmap_values
> for an object I found in the metadata pool called '600.' and
> dropped them at the below location
>
> https://www.dropbox.com/sh/wg6irrjg7kie95p/AABk38IB4PXsn2yINpNa9Js5a?dl=0
>
> Could you explain how is it possible to identify stray directory fragments?
>
> Thanks
>
> On Thu, Dec 8, 2016 at 6:30 PM, John Spray > wrote:
> On Thu, Dec 8, 2016 at 3:45 PM, Sean Redmond  > wrote:
> > Hi,
> >
> > We had no changes going on with the ceph pools or ceph servers at the
> time.
> >
> > We have however been hitting this in the last week and it maybe related:
> >
> > http://tracker.ceph.com/issues/17177
>
> Oh, okay -- so you've got corruption in your metadata pool as a result
> of hitting that issue, presumably.
>
> I think in the past people have managed to get past this by taking
> their MDSs offline and manually removing the omap entries in their
> stray directory fragments (i.e. using the `rados` cli on the objects
> starting "600.").
>
> John
>
>
>
> > Thanks
> >
> > On Thu, Dec 8, 2016 at 3:34 PM, John Spray > wrote:
> >>
> >> On Thu, Dec 8, 2016 at 3:11 PM, Sean Redmond  >
> >> wrote:
> >> > Hi,
> >> >
> >> > I have a CephFS cluster that is currently unable to start the mds
> server
> >> > as
> >> > it is hitting an assert, the extract from the mds log is below, any
> >> > pointers
> >> > are welcome:
> >> >
> >> > ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
> >> >
> >> > 2016-12-08 14:50:18.577038 7f7d9faa3700  1 mds.0.47077 handle_mds_map
> >> > state
> >> > change up:rejoin --> up:active
> >> > 2016-12-08 14:50:18.577048 7f7d9faa3700  1 mds.0.47077 recovery_done
> --
> >> > successful recovery!
> >> > 2016-12-08 14:50:18.577166 7f7d9faa3700  1 mds.0.47077 active_start
> >> > 2016-12-08 14:50:19.460208 7f7d9faa3700  1 mds.0.47077 cluster
> >> > recovered.
> >> > 2016-12-08 14:50:19.495685 7f7d9abfc700 -1 mds/CDir.cc: In function
> >> > 'void
> >> > CDir::try_remove_dentries_for_stray()' thread 7f7d9abfc700 time
> >> > 2016-12-08
> 

Re: [ceph-users] High load on OSD processes

2016-12-09 Thread Reed Dier
I don’t think there is a graceful path to downgrade.

There is a hot fix upstream I believe. My understanding is the build is being 
tested for release.

Francois Lafont posted in the other thread:

> Begin forwarded message:
> 
> From: Francois Lafont 
> Subject: Re: [ceph-users] 10.2.4 Jewel released
> Date: December 9, 2016 at 11:54:06 AM CST
> To: "ceph-users@lists.ceph.com" 
> Content-Type: text/plain; charset="us-ascii"
> 
> On 12/09/2016 06:39 PM, Alex Evonosky wrote:
> 
>> Sounds great.  May I asked what procedure you did to upgrade?
> 
> Of course. ;)
> 
> It's here: https://shaman.ceph.com/repos/ceph/wip-msgr-jewel-fix2/
> (I think this link was pointed by Greg Farnum or Sage Weil in a
> previous message).
> 
> Personally I use Ubuntu Trusty, so for me in the page above leads me
> to use this line in my "sources.list":
> 
> deb 
> http://3.chacra.ceph.com/r/ceph/wip-msgr-jewel-fix2/5d3c76c1c6e991649f0beedb80e6823606176d9e/ubuntu/trusty/flavors/default/
>  trusty main
> 
> And after that "apt-get update && apt-get upgrade" etc.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
This is obviously geared towards Ubuntu/Debian, though I’d assume theres an rpm 
of the same build accessible.

Reed

> On Dec 9, 2016, at 4:43 PM, lewis.geo...@innoscale.net wrote:
> 
> Hi Reed,
> Yes, this was just installed yesterday and that is the version. I just 
> retested and it is exactly 15 minutes when the load starts to climb. 
>  
> So, just like Diego, do you know if there is a fix for this yet and when it 
> might be available on the repo? Should I try to install the prior minor 
> release version for now?
>  
> Thank you for the information.
>  
> Have a good day,
>  
> Lewis George
>  
>  
>  
> From: "Diego Castro" 
> Sent: Friday, December 9, 2016 2:26 PM
> To: "Reed Dier" 
> Cc: lewis.geo...@innoscale.net, ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] High load on OSD processes
>  
> Same here, is there any ETA to publish CentOS packages?
>  
>  
> ---
> Diego Castro / The CloudFather
> GetupCloud.com - Eliminamos a Gravidade
>  
> 2016-12-09 18:59 GMT-03:00 Reed Dier  >:
> Assuming you deployed within the last 48 hours, I’m going to bet you are 
> using v10.2.4 which has an issue that causes high cpu utilization.
>  
> Should see large ramp up in loadav after 15 minutes exactly.
>  
> See mailing list thread here: 
> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg34390.html 
> 
>  
> Reed
>  
>  
>> On Dec 9, 2016, at 3:25 PM, lewis.geo...@innoscale.net 
>>  wrote:
>> Hello,
>> I am testing out a new node setup for us and I have configured a node in a 
>> single node cluster. It has 24 OSDs. Everything looked okay during the 
>> initial build and I was able to run the 'rados bench' on it just fine. 
>> However, if I just let the cluster sit and run for a few minutes without 
>> anything happening, the load starts to go up quickly. Each OSD device ends 
>> up using 130% CPU, with the load on the box hitting 550.00. No operations 
>> are going on, nothing shows up in the logs as happening or wrong. If I 
>> restart the OSD processes, the load stays down for a few minutes(almost at 
>> nothing) and then just jumps back up again.
>>  
>> Any idea what could cause this or a direction I can look to check it?
>>  
>> Have a good day,
>>  
>> Lewis George
>>  
>>  
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
>  

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] High load on OSD processes

2016-12-09 Thread lewis.geo...@innoscale.net
Hi Reed,
 Yes, this was just installed yesterday and that is the version. I just 
retested and it is exactly 15 minutes when the load starts to climb.

 So, just like Diego, do you know if there is a fix for this yet and when it 
might be available on the repo? Should I try to install the prior minor release 
version for now?

 Thank you for the information.

 Have a good day,

 Lewis George





 From: "Diego Castro" 
Sent: Friday, December 9, 2016 2:26 PM
To: "Reed Dier" 
Cc: lewis.geo...@innoscale.net, ceph-users@lists.ceph.com
Subject: Re: [ceph-users] High load on OSD processes
 Same here, is there any ETA to publish CentOS packages?

 ---
 Diego Castro / The CloudFather GetupCloud.com - Eliminamos a Gravidade

   2016-12-09 18:59 GMT-03:00 Reed Dier :Assuming 
you deployed within the last 48 hours, I'm going to bet you are using v10.2.4 
which has an issue that causes high cpu utilization.

 Should see large ramp up in loadav after 15 minutes exactly.

 See mailing list thread here: 
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg34390.html

 Reed

   On Dec 9, 2016, at 3:25 PM, lewis.geo...@innoscale.net wrote:

Hello,
 I am testing out a new node setup for us and I have configured a node in a 
single node cluster. It has 24 OSDs. Everything looked okay during the initial 
build and I was able to run the 'rados bench' on it just fine. However, if I 
just let the cluster sit and run for a few minutes without anything happening, 
the load starts to go up quickly. Each OSD device ends up using 130% CPU, with 
the load on the box hitting 550.00. No operations are going on, nothing shows 
up in the logs as happening or wrong. If I restart the OSD processes, the load 
stays down for a few minutes(almost at nothing) and then just jumps back up 
again.

 Any idea what could cause this or a direction I can look to check it?

 Have a good day,

 Lewis George



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] High load on OSD processes

2016-12-09 Thread Diego Castro
Same here, is there any ETA to publish CentOS packages?


---
Diego Castro / The CloudFather
GetupCloud.com - Eliminamos a Gravidade

2016-12-09 18:59 GMT-03:00 Reed Dier :

> Assuming you deployed within the last 48 hours, I’m going to bet you are
> using v10.2.4 which has an issue that causes high cpu utilization.
>
> Should see large ramp up in loadav after 15 minutes exactly.
>
> See mailing list thread here: https://www.mail-
> archive.com/ceph-users@lists.ceph.com/msg34390.html
>
> Reed
>
>
> On Dec 9, 2016, at 3:25 PM, lewis.geo...@innoscale.net wrote:
>
> Hello,
> I am testing out a new node setup for us and I have configured a node in a
> single node cluster. It has 24 OSDs. Everything looked okay during the
> initial build and I was able to run the 'rados bench' on it just fine.
> However, if I just let the cluster sit and run for a few minutes without
> anything happening, the load starts to go up quickly. Each OSD device ends
> up using 130% CPU, with the load on the box hitting 550.00. No operations
> are going on, nothing shows up in the logs as happening or wrong. If I
> restart the OSD processes, the load stays down for a few minutes(almost at
> nothing) and then just jumps back up again.
>
> Any idea what could cause this or a direction I can look to check it?
>
> Have a good day,
>
> Lewis George
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 2x replication: A BIG warning

2016-12-09 Thread David Turner
I'm pretty certain that the write returns as complete only after all active 
OSDs for a PG have completed the write regardless of min_size.



[cid:image87d2ad.JPG@6e2c58b3.4d9df465]   David 
Turner | Cloud Operations Engineer | StorageCraft Technology 
Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943



If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.




From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Oliver 
Humpage [oli...@watershed.co.uk]
Sent: Friday, December 09, 2016 2:31 PM
To: ceph-us...@ceph.com
Subject: Re: [ceph-users] 2x replication: A BIG warning


On 7 Dec 2016, at 15:01, Wido den Hollander 
> wrote:

I would always run with min_size = 2 and manually switch to min_size = 1 if the 
situation really requires it at that moment.

Thanks for this thread, it’s been really useful.

I might have misunderstood, but does min_size=2 also mean that writes have to 
wait for at least 2 OSDs to have data written before the write is confirmed? I 
always assumed this would have a noticeable effect on performance and so left 
it at 1.

Our use case is RBDs being exported as iSCSI for ESXi. OSDs are journalled on 
enterprise SSDs, servers are linked with 10Gb, and we’re generally getting very 
acceptable speeds. Any idea as to how upping min_size to 2 might affect things, 
or should we just try it and see?

Oliver.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] High load on OSD processes

2016-12-09 Thread Reed Dier
Assuming you deployed within the last 48 hours, I’m going to bet you are using 
v10.2.4 which has an issue that causes high cpu utilization.

Should see large ramp up in loadav after 15 minutes exactly.

See mailing list thread here: 
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg34390.html 


Reed


> On Dec 9, 2016, at 3:25 PM, lewis.geo...@innoscale.net wrote:
> 
> Hello,
> I am testing out a new node setup for us and I have configured a node in a 
> single node cluster. It has 24 OSDs. Everything looked okay during the 
> initial build and I was able to run the 'rados bench' on it just fine. 
> However, if I just let the cluster sit and run for a few minutes without 
> anything happening, the load starts to go up quickly. Each OSD device ends up 
> using 130% CPU, with the load on the box hitting 550.00. No operations are 
> going on, nothing shows up in the logs as happening or wrong. If I restart 
> the OSD processes, the load stays down for a few minutes(almost at nothing) 
> and then just jumps back up again.
>  
> Any idea what could cause this or a direction I can look to check it?
>  
> Have a good day,
>  
> Lewis George
>  
>  
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] filestore_split_multiple hardcoded maximum?

2016-12-09 Thread Dan van der Ster
Coincidentally, we've been suffering from split-induced slow requests on
one of our clusters for the past week.

I wanted to add that it isn't at all obvious when slow requests are being
caused by filestore splitting. (When you increase the filestore/osd logs to
10, probably also 20, all you see is that an object write is taking >30s,
which seems totally absurd.) So only after a lot of head scratching I
noticed this thread and realized it could be the splitting -- sure enough,
our PGs were crossing the 5120 object threshold, one-by-one at a rate of
around 5-10 PGs per hour.

I've just sent this PR for comments:

   https://github.com/ceph/ceph/pull/12421

IMHO, this (or something similar) would help operators a bunch in
identifying when this is happening.

Thanks!

Dan



On Fri, Dec 9, 2016 at 7:27 PM, David Turner 
wrote:

> Our 32k PGs each have about 25-30k objects (25-30GB per PG).  When we
> first contracted with Redhat support, they recommended for us to have our
> setting at about 4000 files per directory before splitting into
> subfolders.  When we split into subfolders with that setting, an
> osd_heartbeat_grace (how long before an OSD can't be reached before
> reporting it down to the MONs) of 60 was needed to not flap OSDs during
> subfolder splitting.
>
> With the plan to go back and lower the setting again, we would increase
> that setting to make it through a holiday weekend or a time where we needed
> to have higher performance.  When we went to lower it, it was too painful
> to get through and now we're at what looks like a hardcoded maximum of
> 12,800 objects per subfolder before a split is forced.  At the amount of
> objects now, we have to use an osd_heartbeat_grace of 240 to avoid flapping
> OSDs during subfolder splitting.
>
> Unless you NEED to merge your subfolders, you can set your filestore merge
> threshold to a negative number and it will never merge.  The equation for
> knowing when to split further takes the absolute value of the merge
> threshold so you can just invert it to a negative number and not change the
> behavior of splitting while disabling merging.
>
> The OSDs flapping is unrelated to the 10.2.3 bug.  We're currently on
> 0.94.7 and have had this problem since Firefly.  The flapping is due to the
> OSD being so involved in the process to split the subfolder that it isn't
> responding to other requests, that's why using osd_heartbeat_grace gets us
> through the splitting.
>
> 1) We do not have SELinux installed on our Ubuntu servers.
>
> 2) We monitor and manage our fragmentation and haven't seen much of an
> issue since we increased our alloc_size in the mount options for XFS.
>
> "5) pre-splitting PGs is I think the right answer."  Pre-splitting PGs is
> counter-intuitive.  It's a good theory, but an ineffective practice.  When
> a PG backfills to a new OSD it builds the directory structure according to
> the current settings of how deep the folder structure should be.  So if you
> lose a drive or add storage, all of the PGs that move are no longer
> pre-split to where you think they are.  We have seen multiple times where
> PGs are different depths on different OSDs.  It is not a PG state as to how
> deep it's folder structure is, but a local state per copy of the PG on each
> OSD.
>
>
> Ultimately we're looking to Bluestore to be our Knight in Shining Armor to
> come and save us from all of this, but in the meantime, I have a couple
> ideas for how to keep our clusters usable.
>
> We add storage regularly without our cluster being completely unusable.  I
> took that idea and am testing this with some OSDs to weight the OSDs to 0,
> backfill all of the data off, restart them with new split/merge thresholds,
> and backfill data back onto them.  This would build the PG's on the OSDs
> with the current settings and get us away from the 12,800 objects setting
> we're stuck at now.  The next round will weight the next set of drives to 0
> while we start to backfill onto the previous drives with the new settings.
>  I have some very efficient weighting techniques that keep the cluster
> balanced while doing this, but it did take 2 days to finish backfilling off
> of the 32 drives.  Cluster performance was fairly poor during this and I
> can only do 3 out of our 30 nodes at a time which is a long time of
> running in a degraded state.
>
> The modification to the ceph-objectstore-tool in 10.2.4 and 0.94.10 looks
> very promising to help us manage this.  Doing the splits offline would work
> out quite well for us.  We're testing our QA environment with 10.2.3 and
> are putting some of that testing on hold until 10.2.4 is fixed.
>
> --
>
>  David Turner | Cloud Operations Engineer | 
> StorageCraft
> Technology Corporation 
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2760 <(801)%20871-2760> | Mobile: 385.224.2943
> 

Re: [ceph-users] 2x replication: A BIG warning

2016-12-09 Thread Oliver Humpage

> On 7 Dec 2016, at 15:01, Wido den Hollander  wrote:
> 
> I would always run with min_size = 2 and manually switch to min_size = 1 if 
> the situation really requires it at that moment.

Thanks for this thread, it’s been really useful.

I might have misunderstood, but does min_size=2 also mean that writes have to 
wait for at least 2 OSDs to have data written before the write is confirmed? I 
always assumed this would have a noticeable effect on performance and so left 
it at 1.

Our use case is RBDs being exported as iSCSI for ESXi. OSDs are journalled on 
enterprise SSDs, servers are linked with 10Gb, and we’re generally getting very 
acceptable speeds. Any idea as to how upping min_size to 2 might affect things, 
or should we just try it and see?

Oliver.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Server crashes on high mount volume

2016-12-09 Thread Diego Castro
Hello, my case is very specific but i think other may have this issue.

I have a ceph cluster up and running hosting block storage for my openshift
(kubernetes) cluster.
Things goes bad when i "evacuate" a node, which is move all containers to
other hosts, when this happens i can see a lot of map/mount commands and
suddenly the node crashes, here is the log [1].


1.https://gist.github.com/spinolacastro/ff2bb85b3768a71d3ff6d1d6d85f00a2

[root@n-13-0 ~]# cat /etc/redhat-release
CentOS Linux release 7.2.1511 (Core)

[root@n-13-0 ~]# uname -a
Linux n-13-0 3.10.0-327.36.3.el7.x86_64 #1 SMP Mon Oct 24 16:09:20 UTC 2016
x86_64 x86_64 x86_64 GNU/Linux

[root@n-13-0 ~]# rpm -qa | grep ceph
python-cephfs-10.2.4-0.el7.x86_64
ceph-common-10.2.4-0.el7.x86_64
libcephfs1-10.2.4-0.el7.x86_64

Both osd and clients runs Jewel.


---
Diego Castro / The CloudFather
GetupCloud.com - Eliminamos a Gravidade
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] High load on OSD processes

2016-12-09 Thread lewis.geo...@innoscale.net
Hello,
 I am testing out a new node setup for us and I have configured a node in a 
single node cluster. It has 24 OSDs. Everything looked okay during the 
initial build and I was able to run the 'rados bench' on it just fine. 
However, if I just let the cluster sit and run for a few minutes without 
anything happening, the load starts to go up quickly. Each OSD device ends 
up using 130% CPU, with the load on the box hitting 550.00. No operations 
are going on, nothing shows up in the logs as happening or wrong. If I 
restart the OSD processes, the load stays down for a few minutes(almost at 
nothing) and then just jumps back up again.
  
 Any idea what could cause this or a direction I can look to check it?
  
 Have a good day,
  
 Lewis George
  
  

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance measurements CephFS vs. RBD

2016-12-09 Thread Gregory Farnum
On Fri, Dec 9, 2016 at 6:58 AM, plataleas  wrote:
> Hi all
>
> We enabled CephFS on our Ceph Cluster consisting of:
> - 3 Monitor servers
> - 2 Metadata servers
> - 24 OSD  (3 OSD / Server)
> - Spinning disks, OSD Journal is on SSD
> - Public and Cluster Network separated, all 1GB
> - Release: Jewel 10.2.3
>
> With CephFS we reach roughly 1/3 of the write performance of RBD. There are
> some other discussions about RBD outperforming CephFS on the mailing list.
> However it would be interesting to have more figures about that topic.
>
> Writes on CephFS:
>
> # dd if=/dev/zero of=/data_cephfs/testfile.dd bs=50M count=1 oflag=direct
> 1+0 records in
> 1+0 records out
> 52428800 bytes (52 MB) copied, 1.40136 s, 37.4 MB/s
>
> #dd if=/dev/zero of=/data_cephfs/testfile.dd bs=500M count=1 oflag=direct
> 1+0 records in
> 1+0 records out
> 524288000 bytes (524 MB) copied, 13.9494 s, 37.6 MB/s
>
> # dd if=/dev/zero of=/data_cephfs/testfile.dd bs=1000M count=1 oflag=direct
> 1+0 records in
> 1+0 records out
> 1048576000 bytes (1.0 GB) copied, 27.7233 s, 37.8 MB/s
>
> Writes on RBD
>
> # dd if=/dev/zero of=/data_rbd/testfile.dd bs=50M count=1 oflag=direct
> 1+0 records in
> 1+0 records out
> 52428800 bytes (52 MB) copied, 0.558617 s, 93.9 MB/s
>
> # dd if=/dev/zero of=/data_rbd/testfile.dd bs=500M count=1 oflag=direct
> 1+0 records in
> 1+0 records out
> 524288000 bytes (524 MB) copied, 3.70657 s, 141 MB/s
>
> # dd if=/dev/zero of=/data_rbd/testfile.dd bs=1000M count=1 oflag=direct
> 1+0 records in
> 1+0 records out
> 1048576000 bytes (1.0 GB) copied, 7.75926 s, 135 MB/s
>
> Are these measurements reproducible by others ? Thanks for sharing your
> experience!

IIRC, the interfaces in use mean these are doing very different things
despite the flag similarity. Direct IO on rbd is still making use of
the RBD cache, but in CephFS it is going straight to the OSD (if
you're using the kernel client; if you're on ceph-fuse the flags might
get dropped on the kernel/FUSE barrier).
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 10.2.4 Jewel released

2016-12-09 Thread David Turner
"all OSDs are running jewel or later but the 'require_jewel_osds' osdmap flag 
is not set"

It's noted in the release notes that this will happen and that you then just 
set the flag and it goes away.



[cid:image0cf66a.JPG@55847b03.4ab222e1]   David 
Turner | Cloud Operations Engineer | StorageCraft Technology 
Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943



If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.




From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Andrey Shevel 
[shevel.and...@gmail.com]
Sent: Friday, December 09, 2016 1:33 PM
To: ceph-users
Subject: Re: [ceph-users] 10.2.4 Jewel released

I did
yum update

and found out that ceph has version 10.2.4 and also after update I
have the message

"all OSDs are running jewel or later but the 'require_jewel_osds'
osdmap flag is not set"

===
[ceph@ceph-swift-gateway ~]$ ceph -v
ceph version 10.2.4 (9411351cc8ce9ee03fbd46225102fe3d28ddf611)

[ceph@ceph-swift-gateway ~]$ cat /etc/yum.repos.d/ceph.repo
[Ceph]
name=Ceph packages for $basearch
baseurl=http://download.ceph.com/rpm-jewel/el7/$basearch
enabled=1
gpgcheck=1
type=rpm-md
gpgkey=https://download.ceph.com/keys/release.asc
priority=1

[Ceph-noarch]
name=Ceph noarch packages
baseurl=http://download.ceph.com/rpm-jewel/el7/noarch
enabled=1
gpgcheck=1
type=rpm-md
gpgkey=https://download.ceph.com/keys/release.asc
priority=1

[ceph-source]
name=Ceph source packages
baseurl=http://download.ceph.com/rpm-jewel/el7/SRPMS
enabled=1
gpgcheck=1
type=rpm-md
gpgkey=https://download.ceph.com/keys/release.asc
priority=1


[ceph@ceph-swift-gateway ~]$ ceph -s
   cluster 65b8080e-d813-45ca-9cc1-ecb242967694
health HEALTH_WARN
   all OSDs are running jewel or later but the
'require_jewel_osds' osdmap flag is not set
monmap e22: 4 mons at
{osd2=10.10.1.12:6789/0,osd3=10.10.1.13:6789/0,osd4=10.10.1.14:6789/0,stor=10.10.1.41:6789/0}
   election epoch 317068, quorum 0,1,2,3 osd2,osd3,osd4,stor
 fsmap e1854002: 1/1/1 up {0=osd2=up:active}, 2 up:standby
osdmap e1889715: 22 osds: 22 up, 22 in
 pgmap v6294339: 3472 pgs, 28 pools, 10805 MB data, 3218 objects
   10890 MB used, 81891 GB / 81902 GB avail
   3472 active+clean


is ceph 10.2.4.-1  still in test stage ?



On Fri, Dec 9, 2016 at 9:30 PM, Udo Lembke  wrote:
> Hi,
>
> unfortunately there are no Debian Jessie packages...
>
>
> Don't know that an recompile take such an long time for ceph... I think
> such an important fix should hit the repros faster.
>
>
> Udo
>
>
> On 09.12.2016 18:54, Francois Lafont wrote:
>> On 12/09/2016 06:39 PM, Alex Evonosky wrote:
>>
>>> Sounds great.  May I asked what procedure you did to upgrade?
>> Of course. ;)
>>
>> It's here: https://shaman.ceph.com/repos/ceph/wip-msgr-jewel-fix2/
>> (I think this link was pointed by Greg Farnum or Sage Weil in a
>> previous message).
>>
>> Personally I use Ubuntu Trusty, so for me in the page above leads me
>> to use this line in my "sources.list":
>>
>> deb 
>> http://3.chacra.ceph.com/r/ceph/wip-msgr-jewel-fix2/5d3c76c1c6e991649f0beedb80e6823606176d9e/ubuntu/trusty/flavors/default/
>>  trusty main
>>
>> And after that "apt-get update && apt-get upgrade" etc.
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Andrey Y Shevel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 10.2.4 Jewel released

2016-12-09 Thread Andrey Shevel
I did
yum update

and found out that ceph has version 10.2.4 and also after update I
have the message

"all OSDs are running jewel or later but the 'require_jewel_osds'
osdmap flag is not set"

===
[ceph@ceph-swift-gateway ~]$ ceph -v
ceph version 10.2.4 (9411351cc8ce9ee03fbd46225102fe3d28ddf611)

[ceph@ceph-swift-gateway ~]$ cat /etc/yum.repos.d/ceph.repo
[Ceph]
name=Ceph packages for $basearch
baseurl=http://download.ceph.com/rpm-jewel/el7/$basearch
enabled=1
gpgcheck=1
type=rpm-md
gpgkey=https://download.ceph.com/keys/release.asc
priority=1

[Ceph-noarch]
name=Ceph noarch packages
baseurl=http://download.ceph.com/rpm-jewel/el7/noarch
enabled=1
gpgcheck=1
type=rpm-md
gpgkey=https://download.ceph.com/keys/release.asc
priority=1

[ceph-source]
name=Ceph source packages
baseurl=http://download.ceph.com/rpm-jewel/el7/SRPMS
enabled=1
gpgcheck=1
type=rpm-md
gpgkey=https://download.ceph.com/keys/release.asc
priority=1


[ceph@ceph-swift-gateway ~]$ ceph -s
cluster 65b8080e-d813-45ca-9cc1-ecb242967694
 health HEALTH_WARN
all OSDs are running jewel or later but the
'require_jewel_osds' osdmap flag is not set
 monmap e22: 4 mons at
{osd2=10.10.1.12:6789/0,osd3=10.10.1.13:6789/0,osd4=10.10.1.14:6789/0,stor=10.10.1.41:6789/0}
election epoch 317068, quorum 0,1,2,3 osd2,osd3,osd4,stor
  fsmap e1854002: 1/1/1 up {0=osd2=up:active}, 2 up:standby
 osdmap e1889715: 22 osds: 22 up, 22 in
  pgmap v6294339: 3472 pgs, 28 pools, 10805 MB data, 3218 objects
10890 MB used, 81891 GB / 81902 GB avail
3472 active+clean


is ceph 10.2.4.-1  still in test stage ?



On Fri, Dec 9, 2016 at 9:30 PM, Udo Lembke  wrote:
> Hi,
>
> unfortunately there are no Debian Jessie packages...
>
>
> Don't know that an recompile take such an long time for ceph... I think
> such an important fix should hit the repros faster.
>
>
> Udo
>
>
> On 09.12.2016 18:54, Francois Lafont wrote:
>> On 12/09/2016 06:39 PM, Alex Evonosky wrote:
>>
>>> Sounds great.  May I asked what procedure you did to upgrade?
>> Of course. ;)
>>
>> It's here: https://shaman.ceph.com/repos/ceph/wip-msgr-jewel-fix2/
>> (I think this link was pointed by Greg Farnum or Sage Weil in a
>> previous message).
>>
>> Personally I use Ubuntu Trusty, so for me in the page above leads me
>> to use this line in my "sources.list":
>>
>> deb 
>> http://3.chacra.ceph.com/r/ceph/wip-msgr-jewel-fix2/5d3c76c1c6e991649f0beedb80e6823606176d9e/ubuntu/trusty/flavors/default/
>>  trusty main
>>
>> And after that "apt-get update && apt-get upgrade" etc.
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Andrey Y Shevel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Kraken 11.x feedback

2016-12-09 Thread Ben Hines
Not particularly, i just never did the Jewel upgrade. (normally like to
stay relatively current)

-Ben

On Fri, Dec 9, 2016 at 11:40 AM, Samuel Just  wrote:

> Is there a particular reason you are sticking to the versions with
> shorter support periods?
> -Sam
>
> On Fri, Dec 9, 2016 at 11:38 AM, Ben Hines  wrote:
> > Anyone have any good / bad experiences with Kraken? I haven't seen much
> > discussion of it. Particularly from the RGW front.
> >
> > I'm still on Infernalis for our cluster, considering going up to K.
> >
> > thanks,
> >
> > -Ben
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Kraken 11.x feedback

2016-12-09 Thread Samuel Just
Is there a particular reason you are sticking to the versions with
shorter support periods?
-Sam

On Fri, Dec 9, 2016 at 11:38 AM, Ben Hines  wrote:
> Anyone have any good / bad experiences with Kraken? I haven't seen much
> discussion of it. Particularly from the RGW front.
>
> I'm still on Infernalis for our cluster, considering going up to K.
>
> thanks,
>
> -Ben
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Kraken 11.x feedback

2016-12-09 Thread Ben Hines
Anyone have any good / bad experiences with Kraken? I haven't seen much
discussion of it. Particularly from the RGW front.

I'm still on Infernalis for our cluster, considering going up to K.

thanks,

-Ben
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Problems with multipart RGW uploads.

2016-12-09 Thread Martin Bureau
Hello,


I am looking for help with a problem we have with our Jewel (10.2.4-1-g5d3c76c) 
cluster. Some files (which show up in the bucket listing) cannot be downloaded 
and return HTTP 404 and "ERROR: got unexpected error when trying to read 
object: -2" in the rgw log.


Regards,

Martin




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS FAILED assert(dn->get_linkage()->is_null())

2016-12-09 Thread Goncalo Borges
Hi Sean, Rob.

I saw on the tracker that you were able to resolve the mds assert by manually 
cleaning the corrupted metadata. Since I am also hitting that issue and I 
suspect that i will face an mds assert of the same type sooner or later, can 
you please explain a bit further what operations did you do to clean the 
problem?
Cheers
Goncalo

From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Rob Pickerill 
[r.picker...@gmail.com]
Sent: 09 December 2016 07:13
To: Sean Redmond; John Spray
Cc: ceph-users
Subject: Re: [ceph-users] CephFS FAILED assert(dn->get_linkage()->is_null())

Hi John / All

Thank you for the help so far.

To add a further point to Sean's previous email, I see this log entry before 
the assertion failure:

-6> 2016-12-08 15:47:08.483700 7fb133dca700 12 mds.0.cache.dir(1000a453344) 
remove_dentry [dentry #100/stray9/1000a453344/config [2,head] auth NULL (dver
sion lock) v=540 inode=0 0x55e8664fede0]
-5> 2016-12-08 15:47:08.484882 7fb133dca700 -1 mds/CDir.cc: In function 
'void CDir::try_remove_dentries_for_stray()' thread 7fb133dca700 time 2016-12-08
15:47:08.483704
mds/CDir.cc: 699: FAILED assert(dn->get_linkage()->is_null())

And I can reference this with:

root@ceph-mon1:~/1000a453344# rados -p ven-ceph-metadata-1 listomapkeys 
1000a453344.
1470734502_head
config_head

Would we also need to clean up this object, if so is there a safe we can do 
this?

Rob

On Thu, 8 Dec 2016 at 19:58 Sean Redmond 
> wrote:
Hi John,

Thanks for your pointers, I have extracted the onmap_keys and onmap_values for 
an object I found in the metadata pool called '600.' and dropped them 
at the below location

https://www.dropbox.com/sh/wg6irrjg7kie95p/AABk38IB4PXsn2yINpNa9Js5a?dl=0

Could you explain how is it possible to identify stray directory fragments?

Thanks

On Thu, Dec 8, 2016 at 6:30 PM, John Spray 
> wrote:
On Thu, Dec 8, 2016 at 3:45 PM, Sean Redmond 
> wrote:
> Hi,
>
> We had no changes going on with the ceph pools or ceph servers at the time.
>
> We have however been hitting this in the last week and it maybe related:
>
> http://tracker.ceph.com/issues/17177

Oh, okay -- so you've got corruption in your metadata pool as a result
of hitting that issue, presumably.

I think in the past people have managed to get past this by taking
their MDSs offline and manually removing the omap entries in their
stray directory fragments (i.e. using the `rados` cli on the objects
starting "600.").

John



> Thanks
>
> On Thu, Dec 8, 2016 at 3:34 PM, John Spray 
> > wrote:
>>
>> On Thu, Dec 8, 2016 at 3:11 PM, Sean Redmond 
>> >
>> wrote:
>> > Hi,
>> >
>> > I have a CephFS cluster that is currently unable to start the mds server
>> > as
>> > it is hitting an assert, the extract from the mds log is below, any
>> > pointers
>> > are welcome:
>> >
>> > ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
>> >
>> > 2016-12-08 14:50:18.577038 7f7d9faa3700  1 mds.0.47077 handle_mds_map
>> > state
>> > change up:rejoin --> up:active
>> > 2016-12-08 14:50:18.577048 7f7d9faa3700  1 mds.0.47077 recovery_done --
>> > successful recovery!
>> > 2016-12-08 14:50:18.577166 7f7d9faa3700  1 mds.0.47077 active_start
>> > 2016-12-08 14:50:19.460208 7f7d9faa3700  1 mds.0.47077 cluster
>> > recovered.
>> > 2016-12-08 14:50:19.495685 7f7d9abfc700 -1 mds/CDir.cc: In function
>> > 'void
>> > CDir::try_remove_dentries_for_stray()' thread 7f7d9abfc700 time
>> > 2016-12-08
>> > 14:50:19
>> > .494508
>> > mds/CDir.cc: 699: FAILED assert(dn->get_linkage()->is_null())
>> >
>> >  ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
>> >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> > const*)+0x80) [0x55f0f789def0]
>> >  2: (CDir::try_remove_dentries_for_stray()+0x1a0) [0x55f0f7c0]
>> >  3: (StrayManager::__eval_stray(CDentry*, bool)+0x8c9) [0x55f0f75e7799]
>> >  4: (StrayManager::eval_stray(CDentry*, bool)+0x22) [0x55f0f75e7cf2]
>> >  5: (MDCache::scan_stray_dir(dirfrag_t)+0x16d) [0x55f0f753b30d]
>> >  6: (MDSInternalContextBase::complete(int)+0x18b) [0x55f0f76e93db]
>> >  7: (MDSRank::_advance_queues()+0x6a7) [0x55f0f749bf27]
>> >  8: (MDSRank::ProgressThread::entry()+0x4a) [0x55f0f749c45a]
>> >  9: (()+0x770a) [0x7f7da6bdc70a]
>> >  10: (clone()+0x6d) [0x7f7da509d82d]
>> >  NOTE: a copy of the executable, or `objdump -rdS ` is
>> > needed to
>> > interpret this.
>>
>> Last time someone had this issue they had tried to create a filesystem
>> using pools that had another filesystem's old objects in:
>> http://tracker.ceph.com/issues/16829
>>
>> What was going on on your system before you hit this?
>>
>> John
>>
>> > Thanks
>> >
>> > 

Re: [ceph-users] 2x replication: A BIG warning

2016-12-09 Thread Kees Meijs
Hi Wido,

Since it's a Friday night, I decided to just go for it. ;-)

It took a while to rebalance the cache tier but all went well. Thanks
again for your valuable advice!

Best regards, enjoy your weekend,
Kees

On 07-12-16 14:58, Wido den Hollander wrote:
>> Anyway, any things to consider or could we just:
>>
>>  1. Run "ceph osd pool set cache size 3".
>>  2. Wait for rebalancing to complete.
>>  3. Run "ceph osd pool set cache min_size 2".
>>
> Indeed! It is a simple as that.
>
> Your cache pool can also contain very valuable data you do not want to loose.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 10.2.4 Jewel released

2016-12-09 Thread Udo Lembke
Hi,

unfortunately there are no Debian Jessie packages...


Don't know that an recompile take such an long time for ceph... I think
such an important fix should hit the repros faster.


Udo


On 09.12.2016 18:54, Francois Lafont wrote:
> On 12/09/2016 06:39 PM, Alex Evonosky wrote:
>
>> Sounds great.  May I asked what procedure you did to upgrade?
> Of course. ;)
>
> It's here: https://shaman.ceph.com/repos/ceph/wip-msgr-jewel-fix2/
> (I think this link was pointed by Greg Farnum or Sage Weil in a
> previous message).
>
> Personally I use Ubuntu Trusty, so for me in the page above leads me
> to use this line in my "sources.list":
>
> deb 
> http://3.chacra.ceph.com/r/ceph/wip-msgr-jewel-fix2/5d3c76c1c6e991649f0beedb80e6823606176d9e/ubuntu/trusty/flavors/default/
>  trusty main
>
> And after that "apt-get update && apt-get upgrade" etc.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] filestore_split_multiple hardcoded maximum?

2016-12-09 Thread David Turner
Our 32k PGs each have about 25-30k objects (25-30GB per PG).  When we first 
contracted with Redhat support, they recommended for us to have our setting at 
about 4000 files per directory before splitting into subfolders.  When we split 
into subfolders with that setting, an osd_heartbeat_grace (how long before an 
OSD can't be reached before reporting it down to the MONs) of 60 was needed to 
not flap OSDs during subfolder splitting.

With the plan to go back and lower the setting again, we would increase that 
setting to make it through a holiday weekend or a time where we needed to have 
higher performance.  When we went to lower it, it was too painful to get 
through and now we're at what looks like a hardcoded maximum of 12,800 objects 
per subfolder before a split is forced.  At the amount of objects now, we have 
to use an osd_heartbeat_grace of 240 to avoid flapping OSDs during subfolder 
splitting.

Unless you NEED to merge your subfolders, you can set your filestore merge 
threshold to a negative number and it will never merge.  The equation for 
knowing when to split further takes the absolute value of the merge threshold 
so you can just invert it to a negative number and not change the behavior of 
splitting while disabling merging.

The OSDs flapping is unrelated to the 10.2.3 bug.  We're currently on 0.94.7 
and have had this problem since Firefly.  The flapping is due to the OSD being 
so involved in the process to split the subfolder that it isn't responding to 
other requests, that's why using osd_heartbeat_grace gets us through the 
splitting.

1) We do not have SELinux installed on our Ubuntu servers.

2) We monitor and manage our fragmentation and haven't seen much of an issue 
since we increased our alloc_size in the mount options for XFS.

"5) pre-splitting PGs is I think the right answer."  Pre-splitting PGs is 
counter-intuitive.  It's a good theory, but an ineffective practice.  When a PG 
backfills to a new OSD it builds the directory structure according to the 
current settings of how deep the folder structure should be.  So if you lose a 
drive or add storage, all of the PGs that move are no longer pre-split to where 
you think they are.  We have seen multiple times where PGs are different depths 
on different OSDs.  It is not a PG state as to how deep it's folder structure 
is, but a local state per copy of the PG on each OSD.


Ultimately we're looking to Bluestore to be our Knight in Shining Armor to come 
and save us from all of this, but in the meantime, I have a couple ideas for 
how to keep our clusters usable.

We add storage regularly without our cluster being completely unusable.  I took 
that idea and am testing this with some OSDs to weight the OSDs to 0, backfill 
all of the data off, restart them with new split/merge thresholds, and backfill 
data back onto them.  This would build the PG's on the OSDs with the current 
settings and get us away from the 12,800 objects setting we're stuck at now.  
The next round will weight the next set of drives to 0 while we start to 
backfill onto the previous drives with the new settings.  I have some very 
efficient weighting techniques that keep the cluster balanced while doing this, 
but it did take 2 days to finish backfilling off of the 32 drives.  Cluster 
performance was fairly poor during this and I can only do 3 out of our 30 nodes 
at a time which is a long time of running in a degraded state.

The modification to the ceph-objectstore-tool in 10.2.4 and 0.94.10 looks very 
promising to help us manage this.  Doing the splits offline would work out 
quite well for us.  We're testing our QA environment with 10.2.3 and are 
putting some of that testing on hold until 10.2.4 is fixed.




[cid:imagea17579.JPG@90f04ba6.4f99616e]   David 
Turner | Cloud Operations Engineer | StorageCraft Technology 
Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943



If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.




From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Mark Nelson 
[mnel...@redhat.com]
Sent: Thursday, December 08, 2016 10:25 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] filestore_split_multiple hardcoded maximum?

I don't want to retype it all, but you guys might be interested in the
discussion under section 3 of this post here:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/012987.html

basically the gist of it is:

1) Make sure SELinux isn't doing security xattr lookups for link/unlink
operations (this makes splitting incredibly painful!).  You 

[ceph-users] Pgs stuck on undersized+degraded+peered

2016-12-09 Thread fridifree
Hi,
1 of 3 of my osd servers is down and I get this error
And I do not have any access to rbds on the cluster

Any suggestions?

Thank you
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 10.2.4 Jewel released

2016-12-09 Thread Alex Evonosky
Thank you sir.  Ubuntu here as well.





On Fri, Dec 9, 2016 at 12:54 PM, Francois Lafont <
francois.lafont.1...@gmail.com> wrote:

> On 12/09/2016 06:39 PM, Alex Evonosky wrote:
>
> > Sounds great.  May I asked what procedure you did to upgrade?
>
> Of course. ;)
>
> It's here: https://shaman.ceph.com/repos/ceph/wip-msgr-jewel-fix2/
> (I think this link was pointed by Greg Farnum or Sage Weil in a
> previous message).
>
> Personally I use Ubuntu Trusty, so for me in the page above leads me
> to use this line in my "sources.list":
>
> deb http://3.chacra.ceph.com/r/ceph/wip-msgr-jewel-fix2/
> 5d3c76c1c6e991649f0beedb80e6823606176d9e/ubuntu/trusty/flavors/default/
> trusty main
>
> And after that "apt-get update && apt-get upgrade" etc.
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 10.2.4 Jewel released

2016-12-09 Thread Francois Lafont
On 12/09/2016 06:39 PM, Alex Evonosky wrote:

> Sounds great.  May I asked what procedure you did to upgrade?

Of course. ;)

It's here: https://shaman.ceph.com/repos/ceph/wip-msgr-jewel-fix2/
(I think this link was pointed by Greg Farnum or Sage Weil in a
previous message).

Personally I use Ubuntu Trusty, so for me in the page above leads me
to use this line in my "sources.list":

deb 
http://3.chacra.ceph.com/r/ceph/wip-msgr-jewel-fix2/5d3c76c1c6e991649f0beedb80e6823606176d9e/ubuntu/trusty/flavors/default/
 trusty main

And after that "apt-get update && apt-get upgrade" etc.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 10.2.4 Jewel released

2016-12-09 Thread Alex Evonosky
Francois-

Sounds great.  May I asked what procedure you did to upgrade?

Thank you!




On Fri, Dec 9, 2016 at 12:20 PM, Francois Lafont <
francois.lafont.1...@gmail.com> wrote:

> Hi,
>
> Just for information, after the upgrade to the version
> 10.2.4-1-g5d3c76c (5d3c76c1c6e991649f0beedb80e6823606176d9e)
> of all my cluster (osd, mon and mds) since ~30 hours, I have
> no problem (my cluster is a small cluster with 5 nodes and
> 4 osds per nodes and 3 monitors and I just use cephfs).
>
> Bye.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 10.2.4 Jewel released

2016-12-09 Thread Francois Lafont
Hi,

Just for information, after the upgrade to the version
10.2.4-1-g5d3c76c (5d3c76c1c6e991649f0beedb80e6823606176d9e)
of all my cluster (osd, mon and mds) since ~30 hours, I have
no problem (my cluster is a small cluster with 5 nodes and
4 osds per nodes and 3 monitors and I just use cephfs).

Bye.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 10.2.4 Jewel released

2016-12-09 Thread Graham Allan
On Thu, Dec 8, 2016 at 5:19 AM, Francois Lafont <
francois.lafont.1...@gmail.com> wrote:

> On 12/08/2016 11:24 AM, Ruben Kerkhof wrote:
>
> > I've been running this on one of my servers now for half an hour, and
> > it fixes the issue.
>
> It's the same for me. ;)
>
> ~$ ceph -v
> ceph version 10.2.4-1-g5d3c76c (5d3c76c1c6e991649f0beedb80e6823606176d9e)
>

In our case I applied the above version only to our rados gateways, and
this resolved the problem well enough. There is still high load on the OSD
nodes but they seem responsive and otherwise stable. Before patching the
gateway nodes, radosgw would eventually run way to a loadav of 3000+ and
stop responding.

Should we expect a more general release of a revised 10.2.4? If it's a
matter of a few days or a week, then I'm inclined to just sit tight and
wait.

Thanks for the rapid fix!

Graham
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs down after reboot

2016-12-09 Thread John Petrini
Try using systemctl start ceph-osd*

I usually refer to this documentation for ceph + systemd
https://www.suse.com/documentation/ses-1/book_storage_admin/data/ceph_operating_services.html

___

John Petrini

NOC Systems Administrator   //   *CoreDial, LLC*   //   coredial.com
//   [image:
Twitter]    [image: LinkedIn]
   [image: Google Plus]
   [image: Blog]

Hillcrest I, 751 Arbor Way, Suite 150, Blue Bell PA, 19422
*P: *215.297.4400 x232   //   *F: *215.297.4401   //   *E: *
jpetr...@coredial.com

[image: Exceptional people. Proven Processes. Innovative Technology.
Discover CoreDial - watch our video]


The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission,  dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received
this in error, please contact the sender and delete the material from any
computer.

On Fri, Dec 9, 2016 at 11:13 AM, sandeep.cool...@gmail.com <
sandeep.cool...@gmail.com> wrote:

> Hi,
>
> Im using jewel (10.2.4)  release on centos 7.2, after rebooting one of the
> OSD node, the osd doesn't start. Even after trying the 'systemctl start
> ceph-osd@.service'.
> Does we have to make entry for in fstab for our ceph osd's folder or ceph
> does it automatically?
>
> Then i mounted the correct partition on my disks and tried  'systemctl
> start ceph-osd@.service' , but still osd doesn't comes up.
>
> But when i try with 'ceph-osd -i 1' , the OSD comes UP.
>
> tried searching online but couldn't find anything concrete on this
> problem. Is systemd scripts has some bugs, or im missing something here??
>
> Regards,
> Sandeep
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSDs down after reboot

2016-12-09 Thread sandeep.cool...@gmail.com
Hi,

Im using jewel (10.2.4)  release on centos 7.2, after rebooting one of the
OSD node, the osd doesn't start. Even after trying the 'systemctl start
ceph-osd@.service'.
Does we have to make entry for in fstab for our ceph osd's folder or ceph
does it automatically?

Then i mounted the correct partition on my disks and tried  'systemctl
start ceph-osd@.service' , but still osd doesn't comes up.

But when i try with 'ceph-osd -i 1' , the OSD comes UP.

tried searching online but couldn't find anything concrete on this problem.
Is systemd scripts has some bugs, or im missing something here??

Regards,
Sandeep
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Performance measurements CephFS vs. RBD

2016-12-09 Thread plataleas
Hi all

We enabled CephFS on our Ceph Cluster consisting of:
- 3 Monitor servers
- 2 Metadata servers
- 24 OSD  (3 OSD / Server)
- Spinning disks, OSD Journal is on SSD
- Public and Cluster Network separated, all 1GB
- Release: Jewel 10.2.3

With CephFS we reach roughly 1/3 of the write performance of RBD. There
are some other discussions about RBD outperforming CephFS on the mailing
list. However it would be interesting to have more figures about that
topic.

*Writes on CephFS*:

# dd if=/dev/zero of=/data_cephfs/testfile.dd bs=50M count=1 oflag=direct
1+0 records in
1+0 records out
52428800 bytes (52 MB) copied, 1.40136 s, *37.4 MB/s*

#dd if=/dev/zero of=/data_cephfs/testfile.dd bs=500M count=1 oflag=direct
1+0 records in
1+0 records out
524288000 bytes (524 MB) copied, 13.9494 s, *37.6 MB/s*

# dd if=/dev/zero of=/data_cephfs/testfile.dd bs=1000M count=1 oflag=direct
1+0 records in
1+0 records out
1048576000 bytes (1.0 GB) copied, 27.7233 s, *37.8 MB/s
*

*Writes on RBD*

# dd if=/dev/zero of=/data_rbd/testfile.dd bs=50M count=1 oflag=direct
1+0 records in
1+0 records out
52428800 bytes (52 MB) copied, 0.558617 s, *93.9 MB/s*

# dd if=/dev/zero of=/data_rbd/testfile.dd bs=500M count=1 oflag=direct
1+0 records in
1+0 records out
524288000 bytes (524 MB) copied, 3.70657 s, *141 MB/s*

# dd if=/dev/zero of=/data_rbd/testfile.dd bs=1000M count=1 oflag=direct
1+0 records in
1+0 records out
1048576000 bytes (1.0 GB) copied, 7.75926 s, *135 MB/s*

Are these measurements reproducible by others ? Thanks for sharing your
experience!

regards
martin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] problem after reinstalling system

2016-12-09 Thread Dan van der Ster
On Thu, Dec 8, 2016 at 5:51 PM, Jake Young  wrote:
> Hey Dan,
>
> I had the same issue that Jacek had after changing my OS  and Ceph version
> from Ubuntu 14 - Hammer to Centos 7 - Jewel. I was also able to recover from
> the failure by renaming the .ldb files to .sst files.
>
> Do you know why this works?
>
> Is it just because leveldb changed the file naming standard and it isn't
> backwards compatible with the older version on Centos?

It appears so.

-- dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd showmapped -p and --image options missing in rbd version 10.2.4, why?

2016-12-09 Thread Ilya Dryomov
On Fri, Dec 9, 2016 at 10:52 AM, Stéphane Klein
 wrote:
> Hi,
>
> with: rbd version 0.80.7, `rbd showmapped` have this options:
>
> *   -p, --pool   source pool name
> *  --imageimage name
>
> This options missing in rdb version 10.2.4
>
> Why ? It is a regression ? Is there another command to list map by pool name
> and image name ?

rbd CLI tool was rewritten in infernalis, making argument handling
more strict.  While rbd showmapped might have accepted -p and --image
in 0.80.7, I don't think it ever did any filtering - they were simply
ignored.  If you do "rbd showmapped" without any options, you should
get the same result in 10.2.4.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rbd showmapped -p and --image options missing in rbd version 10.2.4, why?

2016-12-09 Thread Stéphane Klein
Hi,

with: rbd version 0.80.7, `rbd showmapped` have this options:

*   -p, --pool   source pool name
*  --imageimage name

This options missing in rdb version 10.2.4

Why ? It is a regression ? Is there another command to list map by pool
name and image name ?

Best regards,
Stéphane

-- 
Stéphane Klein 
blog: http://stephane-klein.info
cv : http://cv.stephane-klein.info
Twitter: http://twitter.com/klein_stephane
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com