date:20160204

[ceph-users] [rgw][hammer] quota. how it should work?

2016-02-04 Thread Odintsov Vladislav

Hi all,

I'm trying to set up a bucket quotas in hammer and I don't get any errors using 
S3 API and ceph documentation, when I exceed the limit (of total count of 
objects in a bucket, of bucket size), so I've got questions:

There are two places, where I can configure quotas: user and bucket of this 
user. 

User:

# radosgw-admin user info --uid=6e4cc9e4-2262-4dc9-b3b5-f94c7878991c
{
"user_id": "6e4cc9e4-2262-4dc9-b3b5-f94c7878991c",
"display_name": "u...@cloud.croc.ru",
"email": "",
"suspended": 0,
"max_buckets": 10,
"auid": 0,
"subusers": [],
"keys": [
{
"user": "6e4cc9e4-2262-4dc9-b3b5-f94c7878991c",
"access_key": "user:u...@cloud.croc.ru",
"secret_key": "key"
}
],
"swift_keys": [],
"caps": [],
"op_mask": "read, write, delete",
"default_placement": "",
"placement_tags": [],
"bucket_quota": {
"enabled": true,
"max_size_kb": 2,
"max_objects": -1
},
"user_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
},
"temp_url_keys": []
}

# radosgw-admin bucket stats --bucket=eee
{
"bucket": "eee",
"pool": ".rgw.buckets",
"index_pool": ".rgw.buckets.index",
"id": "default.4554.5",
"marker": "default.4554.5",
"owner": "6e4cc9e4-2262-4dc9-b3b5-f94c7878991c",
"ver": "0#1",
"master_ver": "0#0",
"mtime": "2016-02-05 10:05:41.00",
"max_marker": "0#",
"usage": {},
"bucket_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
}
}


Well, in user object bucket_quota is enabled and limited to 2 kb, but in bucket 
object bucket_quota is disabled. Which quota is respectable?
I will not be rejected if I put in bucket of this user object larger, then the 
quota.

I POSTed 5kb object to this bucket without any problems:
# radosgw-admin bucket stats --bucket=eee
{
"bucket": "eee",
"pool": ".rgw.buckets",
"index_pool": ".rgw.buckets.index",
"id": "default.4554.5",
"marker": "default.4554.5",
"owner": "6e4cc9e4-2262-4dc9-b3b5-f94c7878991c",
"ver": "0#3",
"master_ver": "0#0",
"mtime": "2016-02-05 10:05:41.00",
"max_marker": "0#",
"usage": {
"rgw.main": {
"size_kb": 5,
"size_kb_actual": 8,
"num_objects": 1
}
},
"bucket_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
}
}

Now, I'm trying to set quota on bucket and empty the bucket:

# radosgw-admin quota set --bucket=eee --max-size=2048
# radosgw-admin quota enable --bucket=eee
# radosgw-admin bucket stats --bucket=eee
{
"bucket": "eee",
"pool": ".rgw.buckets",
"index_pool": ".rgw.buckets.index",
"id": "default.4554.5",
"marker": "default.4554.5",
"owner": "6e4cc9e4-2262-4dc9-b3b5-f94c7878991c",
"ver": "0#5",
"master_ver": "0#0",
"mtime": "2016-02-05 10:13:02.00",
"max_marker": "0#",
"usage": {
"rgw.main": {
"size_kb": 0,
"size_kb_actual": 0,
"num_objects": 0
}
},
"bucket_quota": {
"enabled": true,
"max_size_kb": 2,
"max_objects": -1
}
}

POSTing 5 kb object, and it again works.

af27: ==> /var/log/httpd/rgw-ssl-access.log <==
af27: 172.20.33.121 - - [05/Feb/2016:10:18:39 +0300] "POST /eee HTTP/1.1" 204 - 
"https://console.c2.croc.ru/storage"; "Mozilla/5.0 (Macintosh; Intel Mac OS X 
10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.97 
Safari/537.36"

# radosgw-admin bucket stats --bucket=eee
{
"bucket": "eee",
"pool": ".rgw.buckets",
"index_pool": ".rgw.buckets.index",
"id": "default.4554.5",
"marker": "default.4554.5",
"owner": "6e4cc9e4-2262-4dc9-b3b5-f94c7878991c",
"ver": "0#7",
"master_ver": "0#0",
"mtime": "2016-02-05 10:13:02.00",
"max_marker": "0#",
"usage": {
"rgw.main": {
"size_kb": 5,
"size_kb_actual": 8,
"num_objects": 1
}
},
"bucket_quota": {
"enabled": true,
"max_size_kb": 2,
"max_objects": -1
}
}


What am I doing wrong?
Or, maybe there's a bug? I'm using 0.94.5 version of ceph, rhel 6.6.
There was a bug, but it seems to be fixed:
http://tracker.ceph.com/issues/11727

Thanks for any thoughts.

--- 
Regards,
 
Vladislav Odintsov
System Engineer of Croc Cloud Development Team
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Confusing message when (re)starting OSDs (location)

2016-02-04 Thread Christian Balzer


Hello,

This is the latest version of Hammer, when restarting an OSD I get this
output:
---
=== osd.23 === 
create-or-move updated item name 'osd.23' weight 0 at location 
{host=engtest03,root=default} to crush map
---

However that host and all OSDs on it reside under a different root and
thankfully still do after the restart despite that message.

---
# ceph osd tree
ID WEIGHT   TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-8  8.0 root testroot
-6  8.0 host engtest03   
20  2.0 osd.20  up  1.0  1.0 
21  2.0 osd.21  up  1.0  1.0 
22  2.0 osd.22  up  1.0  1.0 
23  2.0 osd.23  up  1.0  1.0 
-1 32.0 root default 
-2  8.0 host irt04   
 3  2.0 osd.3   up  1.0  1.0 
. . .
---


Regards,

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Set cache tier pool forward state automatically!

2016-02-04 Thread Christian Balzer

On Thu, 4 Feb 2016 21:33:10 -0700 Robert LeBlanc wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> On Thu, Feb 4, 2016 at 8:32 PM, Christian Balzer  wrote:
> > On Wed, 3 Feb 2016 22:42:32 -0700 Robert LeBlanc wrote:
> 
> > I just finished downgrading my test cluster from testing to Jessie and
> > then upgrading Ceph from Firefly to Hammer (that was fun few hours).
> >
> > And I can confirm that I don't see that issue with Hammer, wonder if
> > it's worth prodding the devs about.
> > I sorta dread the time the PG upgrade process will take when going to
> > Hammer on the overloaded production server.
> > But then again, a fix for Firefly is both unlikely and going to take
> > too long for my case anyway.
> 
> There isn't any PG upgrading as part of the Firefly to Hammer process
> that I can think of. If there is, it wasn't long. Setting noout would
> prevent backfilling so only recovery of delta changes would be needed.
> You don't have to reboot the node, just restart the process so it can
> limit the amount of changes to only a minute or two. It's been a long
> time since I did that upgrade that I just may be plain forgetting
> something.
> 
There is, it wasn't particular long on my test cluster, but that one is
tiny compared to the production one and also was under no I/O load during
the restart:
---
2016-02-05 11:51:19.763477 7fa3654cb900 -1 filestore(/var/lib/ceph/osd/ceph-20) 
FileStore::mount : stale version stamp detected: 3. Proceeding, do_update is 
set, performing disk format upgrade.
2016-02-05 11:51:20.057585 7fa3654cb900  1 filestore upgrade_to_v2 start
2016-02-05 11:51:20.097837 7fa3654cb900  1 filestore upgrade_to_v2 done
2016-02-05 11:51:21.710174 7fa3654cb900 -1 osd.20 9953 PGs are upgrading
---

In total that initial Hammer OSD startup was about twice as long that with
Firefly before and consecutive Hammer restarts, though in this case that
means just 2-3 seconds more.

Guess there's no avoiding that bullet.

Christian

> I think you are right about the fix not going into Firefly since it
> will EoL this year. Glad to know that it is fixed in Hammer.
> 
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.3.4
> Comment: https://www.mailvelope.com
> 
> wsFcBAEBCAAQBQJWtCYCCRDmVDuy+mK58QAAvDEP/i85/cjKQwi4idRzLT9e
> 7oecZ2kTldVNLILLsGhmbg+oABgyKQ7uNY+XTJXSlYMYIKGpoQ9cDO/r9tB3
> nANDVxvVF6yxiA4Pzo8ybytu+qKyOeB17ri3//ReFyyPg+tDJsNpXV+ECUFX
> LZPekvhV397JFS8KoT00nkzGGiWh1PlbQCYqZCNCbsrhIqwCjFq+k5ydKpvv
> qJfTh1d3V0h0vgtbtdC4Vdrzvqr65BoLHNcy6cOlIzPHhkJi6W5rABB6Haec
> sn7onFqsdJn9TSEJ8TSHfgtaWR5vT7y6/AQHHDafXzdr/VZKorwemdeiRwuX
> LEWudwg+J3cf4DrhVlDjv91I24f78/fH4Bm8m/sugo98L/+UqNgCz9VXI4AP
> ejRkZyIkWacEjkrBw8D7QttEEwo58247gYrimb07+MMVX36p+0S7pkpsdH1Y
> 3d3eOuHqqs3mG51eFlZng8Iax029NPQ7Umdt7l/Eru7g7pthJtPEmvPwMMB1
> dcx+X2Aj6G9F+Jsa3hJNTPDsr3cKLOGcS9uu7iQjXVpMfhlF5v/16XCoDKfa
> ZSGc6cEdEKLSfEIe7msD3n2gRLL4QfbXFSf7bJUi9dm6LRLMCBEks4QiakAm
> t0AFeV96xsubg5uplBfkIROND3qU80ccSI5mhey8OC42zHKFj3B/Rf5v+qDF
> X3T3
> =2CEJ
> -END PGP SIGNATURE-
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Set cache tier pool forward state automatically!

2016-02-04 Thread Robert LeBlanc

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

On Thu, Feb 4, 2016 at 8:32 PM, Christian Balzer  wrote:
> On Wed, 3 Feb 2016 22:42:32 -0700 Robert LeBlanc wrote:

> I just finished downgrading my test cluster from testing to Jessie and
> then upgrading Ceph from Firefly to Hammer (that was fun few hours).
>
> And I can confirm that I don't see that issue with Hammer, wonder if it's
> worth prodding the devs about.
> I sorta dread the time the PG upgrade process will take when going to
> Hammer on the overloaded production server.
> But then again, a fix for Firefly is both unlikely and going to take too
> long for my case anyway.

There isn't any PG upgrading as part of the Firefly to Hammer process
that I can think of. If there is, it wasn't long. Setting noout would
prevent backfilling so only recovery of delta changes would be needed.
You don't have to reboot the node, just restart the process so it can
limit the amount of changes to only a minute or two. It's been a long
time since I did that upgrade that I just may be plain forgetting
something.

I think you are right about the fix not going into Firefly since it
will EoL this year. Glad to know that it is fixed in Hammer.

- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.4
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWtCYCCRDmVDuy+mK58QAAvDEP/i85/cjKQwi4idRzLT9e
7oecZ2kTldVNLILLsGhmbg+oABgyKQ7uNY+XTJXSlYMYIKGpoQ9cDO/r9tB3
nANDVxvVF6yxiA4Pzo8ybytu+qKyOeB17ri3//ReFyyPg+tDJsNpXV+ECUFX
LZPekvhV397JFS8KoT00nkzGGiWh1PlbQCYqZCNCbsrhIqwCjFq+k5ydKpvv
qJfTh1d3V0h0vgtbtdC4Vdrzvqr65BoLHNcy6cOlIzPHhkJi6W5rABB6Haec
sn7onFqsdJn9TSEJ8TSHfgtaWR5vT7y6/AQHHDafXzdr/VZKorwemdeiRwuX
LEWudwg+J3cf4DrhVlDjv91I24f78/fH4Bm8m/sugo98L/+UqNgCz9VXI4AP
ejRkZyIkWacEjkrBw8D7QttEEwo58247gYrimb07+MMVX36p+0S7pkpsdH1Y
3d3eOuHqqs3mG51eFlZng8Iax029NPQ7Umdt7l/Eru7g7pthJtPEmvPwMMB1
dcx+X2Aj6G9F+Jsa3hJNTPDsr3cKLOGcS9uu7iQjXVpMfhlF5v/16XCoDKfa
ZSGc6cEdEKLSfEIe7msD3n2gRLL4QfbXFSf7bJUi9dm6LRLMCBEks4QiakAm
t0AFeV96xsubg5uplBfkIROND3qU80ccSI5mhey8OC42zHKFj3B/Rf5v+qDF
X3T3
=2CEJ
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Performance issues related to scrubbing

2016-02-04 Thread Christian Balzer


Hello,

On Thu, 4 Feb 2016 08:44:25 -0800 Cullen King wrote:

> Replies in-line:
> 
> On Wed, Feb 3, 2016 at 9:54 PM, Christian Balzer
>  wrote:
> 
> >
> > Hello,
> >
> > On Wed, 3 Feb 2016 17:48:02 -0800 Cullen King wrote:
> >
> > > Hello,
> > >
> > > I've been trying to nail down a nasty performance issue related to
> > > scrubbing. I am mostly using radosgw with a handful of buckets
> > > containing millions of various sized objects. When ceph scrubs, both
> > > regular and deep, radosgw blocks on external requests, and my
> > > cluster has a bunch of requests that have blocked for > 32 seconds.
> > > Frequently OSDs are marked down.
> > >
> > From my own (painful) experiences let me state this:
> >
> > 1. When your cluster runs out of steam during deep-scrubs, drop what
> > you're doing and order more HW (OSDs).
> > Because this is a sign that it would also be in trouble when doing
> > recoveries.
> >
> 
> When I've initiated recoveries from working on the hardware the cluster
> hasn't had a problem keeping up. It seems that it only has a problem with
> scrubbing, meaning it feels like the IO pattern is drastically
> different. I would think that with scrubbing I'd see something closer to
> bursty sequential reads, rather than just thrashing the drives with a
> more random IO pattern, especially given our low cluster utilization.
>
It's probably more pronounced when phasing in/out entire OSDs, where it
also has to read the entire (primary) data off it.
 
> 
> >
> > 2. If you cluster is inconvenienced by even mere scrubs, you're really
> > in trouble.
> > Threaten the penny pincher with bodily violence and have that new HW
> > phased in yesterday.
> >
> 
> I am the penny pincher, biz owner, dev and ops guy for
> http://ridewithgps.com :) More hardware isn't an issue, it just feels
> pretty crazy to have this low of performance on a 12 OSD system. Granted,
> that feeling isn't backed by anything concrete! In general, I like to
> understand the problem before I solve it with hardware, though I am
> definitely not averse to it. I already ordered 6 more 4tb drives along
> with the new journal SSDs, anticipating the need.
> 
> As you can see from the output of ceph status, we are not space hungry by
> any means.
> 

Well, in Ceph having just one OSD pegged to max will impact (eventually)
everything when they need to read/write primary PGs on it. 

More below.

> 
> >
> > > According to atop, the OSDs being deep scrubbed are reading at only
> > > 5mb/s to 8mb/s, and a scrub of a 6.4gb placement group takes 10-20
> > > minutes.
> > >
> > > Here's a screenshot of atop from a node:
> > > https://s3.amazonaws.com/rwgps/screenshots/DgSSRyeF.png
> > >
> > This looks familiar.
> > Basically at this point in time the competing read request for all the
> > objects clash with write requests and completely saturate your HD
> > (about 120 IOPS and 85% busy according to your atop screenshot).
> >
> 
> In your experience would the scrub operation benefit from a bigger
> readahead? Meaning is it more sequential than random reads? I already
> bumped /sys/block/sd{x}/queue/read_ahead_kb to 512kb.
> 
I played with that long time ago (in benchmark scenarios) and didn't see
any noticeable improvement. 
Deep-scrub might (fragmentation could hurt it though), regular scrub not so
much.

> About half of our reads are on objects with an average size of 40kb (map
> thumbnails), and the other half are on photo thumbs with a size between
> 10kb and 150kb.
> 

Noted, see below.

> After doing a little more researching, I came across this:
> 
> http://tracker.ceph.com/projects/ceph/wiki/Optimize_Newstore_for_massive_small_object_storage
> 
> Sounds like I am probably running into issues with lots of random read
> IO, combined with known issues around small files. To give an idea, I
> have about 15 million small map thumbnails stored in my two largest
> buckets, and I am pushing out about 30 requests per second right now
> from those two buckets.
> 
This is certainly a factor, but that knowledge of a future improvement
won't help you with your current problem of course. ^_-

> 
> 
> > There are ceph configuration options that can mitigate this to some
> > extend and which I don't see in your config, like
> > "osd_scrub_load_threshold" and "osd_scrub_sleep" along with the
> > various IO priority settings.
> > However the points above still stand.
> >
> 
> Yes, I have a running series of notes of config options to try out, just
> wanted to touch base with other community members before shooting in the
> dark.
> 
osd_scrub_sleep is probably the most effective immediately available
option for you to prevent slow, stalled IO. 
At the obvious cost of scrubs taking even longer.
There is of course also the option to disable scrubs entirely until your HW
has been upgraded.

> 
> >
> > XFS defragmentation might help, significantly if your FS is badly
> > fragmented. But again, this is only a temporary band-aid.
> >
> > > First question: is this a r

Re: [ceph-users] Set cache tier pool forward state automatically!

2016-02-04 Thread Christian Balzer

On Wed, 3 Feb 2016 22:42:32 -0700 Robert LeBlanc wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> 
> On Wed, Feb 3, 2016 at 9:00 PM, Christian Balzer  wrote:
> > On Wed, 3 Feb 2016 16:57:09 -0700 Robert LeBlanc wrote:
> 
> > That's an interesting strategy, I suppose you haven't run into the
> > issue I wrote about 2 days ago when switching to forward while running
> > rdb bench?
> 
> We haven't, but we are running 0.94.5. If you are running Firefly,
> that could be why.
> 
I just finished downgrading my test cluster from testing to Jessie and
then upgrading Ceph from Firefly to Hammer (that was fun few hours).

And I can confirm that I don't see that issue with Hammer, wonder if it's
worth prodding the devs about.
I sorta dread the time the PG upgrade process will take when going to
Hammer on the overloaded production server.
But then again, a fix for Firefly is both unlikely and going to take too
long for my case anyway.

> > In my case I venture that the number of really hot objects is small
> > enough to not overwhelm things and that 5K IOPS would be all that
> > cluster ever needs to provide.
> 
> We have 48x Micron M600 1TB drives. They do not perform as well as the
> Intel S3610 800GB which seems to have the right balance of performance
> and durability for our needs. Since we had to under provision the
> M600s, we should be able to get by with the 800GB just fine. Once we
> get the drives swapped out, we may do better than the 10K IOPs as well
> as the recency fix going into the next version of Hammer it will help
> with writeback. With our new cluster, we did fine in writeback mode
> until we hit that 10K IOP limit, then we started getting slow I/O
> messages, turning it to forward mode and things sped up a lot.
> 
Yeah, I'm using 800GB DC S3610s as well.


Christian
> 
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.3.4
> Comment: https://www.mailvelope.com
> 
> wsFcBAEBCAAQBQJWsuTFCRDmVDuy+mK58QAAxO0P/R2mkmoOE/YSD9Ea8uUz
> XBlOI2eibT5DGK6jR/hVL0V0dInNtVM+4yGWEmvJm5nxnwbx+EQd+lCTFQ5y
> WouwGQLCMOCiy0rgduTeTwyGHjeIbloGoYYhZQPEFHOMt1lcKcwiEbrEKUYN
> csUmEApK2aiPna5dMsvQs39/oATuid9Aec8VwcyCozzWUe/UziXVFhdWw3Q5
> 2mz8AuOhrmFqd7iyFN9Dici/DXLhBxWgg4PWn81Ggzq/5LHGyyV6A0jiLCBH
> /B9rUCOmdfBvdK/GxCG7iUqIjVvIR2mtYFkCu7VL/exsnxuGRB2RHYcXgfVH
> rMbZ+gbK/T4XZvUTwDpsfzkEwOTlCuhkcMcHyZLl/MdmcNVXP2+cB9TaCbPI
> Hn2H0CuXqQhZ73znQSVS66/QA7s4W5LzMiAUZnOdIX05eVLnZEgstFr8fSEn
> O95Y4jLYyQB+CIF9IfA6fgGsvnrs0rTGvYEThk6HL1sa6uVwR5PESVJpapS5
> smUenHyp7OPTVdVpGzJh6VOOB08lcA7JFkicCSG1iXTPucuGkuVNMQ2i0LNb
> DA/WAbwUqSK1XHIIu2NCaDZsIbSPwWGXj2uwfNFgSzss1UqAVEF0cBfY6c6n
> 3bdPwY2SgOc7nB+LGDQM6dsaFqDS1E490cFwc85uDTkVOBL0JcAJHAvZV2lD
> w4Tj
> =H+mV
> -END PGP SIGNATURE-
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] network connectivity test tool?

2016-02-04 Thread Nigel Williams

I thought I had book-marked a neat shell script that used the
Ceph.conf definitions to do an all-to-all, all-to-one check of network
connectivity for a Ceph cluster (useful for discovering problems with
jumbo frames), but I've lost the bookmark and after trawling github
and trying various keywords cannot find it.

I thought the tool was in Ceph CBT or was a CERN-developed script, but
neither yielded a hit.

Anyone know where it is? thanks.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] why is there heavy read traffic during object delete?

2016-02-04 Thread Gregory Farnum

On Thu, Feb 4, 2016 at 5:07 PM, Stephen Lord  wrote:
>
>> On Feb 4, 2016, at 6:51 PM, Gregory Farnum  wrote:
>>
>> I presume we're doing reads in order to gather some object metadata
>> from the cephfs-data pool; and the (small) newly-created objects in
>> cache-data are definitely whiteout objects indicating the object no
>> longer exists logically.
>>
>> What kinds of reads are you actually seeing? Does it appear to be
>> transferring data, or merely doing a bunch of seeks? I thought we were
>> trying to avoid doing reads-to-delete, but perhaps the way we're
>> handling snapshots or something is invoking behavior that isn't
>> amicable to a full-FS delete.
>>
>> I presume you're trying to characterize the system's behavior, but of
>> course if you just want to empty it out entirely you're better off
>> deleting the pools and the CephFS instance entirely and then starting
>> it over again from scratch.
>> -Greg
>
> I believe it is reading all the data, just from the volume of traffic and
> the cpu load on the OSDs maybe suggests it is doing more than
> just that.
>
> iostat is showing a lot of data moving, I am seeing about the same volume
> of read and write activity here. Because the OSDs underneath both pools
> are the same ones, I know that’s not exactly optimal, it is hard to tell what
> which pool is responsible for which I/O. Large reads and small writes suggest
> it is reading up all the data from the objects,  the write traffic is I 
> presume all
> journal activity relating to deleting objects and creating the empty ones.
>
> The 9:1 ratio between things being deleted and created seems odd though.
>
> A previous version of this exercise with just a regular replicated data pool
> did not read anything, just a lot of write activity and eventually the content
> disappeared. So definitely related to the pool configuration here and probably
> not to the filesystem layer.

Sam, does this make any sense to you in terms of how RADOS handles deletes?
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] why is there heavy read traffic during object delete?

2016-02-04 Thread Stephen Lord

> On Feb 4, 2016, at 6:51 PM, Gregory Farnum  wrote:
> 
> I presume we're doing reads in order to gather some object metadata
> from the cephfs-data pool; and the (small) newly-created objects in
> cache-data are definitely whiteout objects indicating the object no
> longer exists logically.
> 
> What kinds of reads are you actually seeing? Does it appear to be
> transferring data, or merely doing a bunch of seeks? I thought we were
> trying to avoid doing reads-to-delete, but perhaps the way we're
> handling snapshots or something is invoking behavior that isn't
> amicable to a full-FS delete.
> 
> I presume you're trying to characterize the system's behavior, but of
> course if you just want to empty it out entirely you're better off
> deleting the pools and the CephFS instance entirely and then starting
> it over again from scratch.
> -Greg

I believe it is reading all the data, just from the volume of traffic and
the cpu load on the OSDs maybe suggests it is doing more than
just that.

iostat is showing a lot of data moving, I am seeing about the same volume
of read and write activity here. Because the OSDs underneath both pools
are the same ones, I know that’s not exactly optimal, it is hard to tell what
which pool is responsible for which I/O. Large reads and small writes suggest
it is reading up all the data from the objects,  the write traffic is I presume 
all
journal activity relating to deleting objects and creating the empty ones.

The 9:1 ratio between things being deleted and created seems odd though.

A previous version of this exercise with just a regular replicated data pool
did not read anything, just a lot of write activity and eventually the content
disappeared. So definitely related to the pool configuration here and probably
not to the filesystem layer.

I will eventually just put this out of its misery and wipe it.

Steve

--
The information contained in this transmission may be confidential. Any 
disclosure, copying, or further distribution of confidential information is not 
permitted unless such privilege is explicitly granted in writing by Quantum. 
Quantum reserves the right to have electronic communications, including email 
and attachments, sent across its networks filtered through anti virus and spam 
software programs and retain such messages in order to comply with applicable 
data security and retention requirements. Quantum is not responsible for the 
proper and complete transmission of the substance of this communication or for 
any delay in its receipt.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] why is there heavy read traffic during object delete?

2016-02-04 Thread Gregory Farnum

On Thu, Feb 4, 2016 at 4:37 PM, Stephen Lord  wrote:
> I setup a cephfs file system with a cache tier over an erasure coded tier as 
> an experiment:
>
>   ceph osd erasure-code-profile set raid6 k=4 m=2
>   ceph osd pool create cephfs-metadata 512 512
>   ceph osd pool set cephfs-metadata size 3
>   ceph osd pool create cache-data 2048 2048
>   ceph osd pool create cephfs-data 256 256 erasure raid6 default_erasure
>   ceph osd tier add cephfs-data cache-data
>   ceph osd tier cache-mode cache-data writeback
>   ceph osd tier set-overlay cephfs-data cache-data
>   ceph osd pool set cache-data hit_set_type bloom
>   ceph osd pool set cache-data target_max_bytes 1099511627776
>
> The file system was created from the cephfs-metadata and cephfs-data pools
>
> After adding a lot of data to this and waiting for the pools to idle down and 
> stabilize I removing the file system content with rm. I am seeing very 
> strange behavior, the file system remove was quick, and then it started 
> removing the data from the pools. However it appears to be reading the data 
> from the erasure coded pool and creating empty content in the cache pool.
>
> At its peak capacity the system looked like this:
>
> NAMEID CATEGORY USED   %USED MAX AVAIL
>  OBJECTS DIRTY READ  WRITE
> cache-data  21 -  791G  5.25 6755G
>   256302  140k 22969 2138k
> cephfs-data 22 - 4156G 27.54 4503G
>  1064086 1039k 51271 1046k
>
>
> 2 hours later it looked like this:
>
> NAMEID CATEGORY USED   %USED MAX AVAIL
>  OBJECTS DIRTY READ  WRITE
> cache-data  21 -  326G  2.17 7576G
>   702142  559k 22969 2689k
> cephfs-data 22 - 3964G 26.27 5051G
>  1014842  991k  476k 1143k
>
> The object count in the erasure coded pool has gone down a little, the count 
> in the cache pool has gone up a lot, there has been a lot of read activity in 
> the erasure coded pool and write activity into both pools. The used count in 
> the cache pool is also going down. It looks like the cache pool is gaining 9 
> objects for each one removed from the erasure code pool. Looking at the 
> actual files being created by the OSD for this, they are empty.
>
> What is going on here? It looks like this will take a day or so to complete 
> at this rate of progress.

I presume we're doing reads in order to gather some object metadata
from the cephfs-data pool; and the (small) newly-created objects in
cache-data are definitely whiteout objects indicating the object no
longer exists logically.

What kinds of reads are you actually seeing? Does it appear to be
transferring data, or merely doing a bunch of seeks? I thought we were
trying to avoid doing reads-to-delete, but perhaps the way we're
handling snapshots or something is invoking behavior that isn't
amicable to a full-FS delete.

I presume you're trying to characterize the system's behavior, but of
course if you just want to empty it out entirely you're better off
deleting the pools and the CephFS instance entirely and then starting
it over again from scratch.
-Greg


>
> The ceph version here is the master branch from a couple of days ago.
>
> Thanks
>
>   Steve Lord
>
>
>
> --
> The information contained in this transmission may be confidential. Any 
> disclosure, copying, or further distribution of confidential information is 
> not permitted unless such privilege is explicitly granted in writing by 
> Quantum. Quantum reserves the right to have electronic communications, 
> including email and attachments, sent across its networks filtered through 
> anti virus and spam software programs and retain such messages in order to 
> comply with applicable data security and retention requirements. Quantum is 
> not responsible for the proper and complete transmission of the substance of 
> this communication or for any delay in its receipt.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] why is there heavy read traffic during object delete?

2016-02-04 Thread Stephen Lord

I setup a cephfs file system with a cache tier over an erasure coded tier as an 
experiment:

  ceph osd erasure-code-profile set raid6 k=4 m=2 
  ceph osd pool create cephfs-metadata 512 512 
  ceph osd pool set cephfs-metadata size 3
  ceph osd pool create cache-data 2048 2048
  ceph osd pool create cephfs-data 256 256 erasure raid6 default_erasure
  ceph osd tier add cephfs-data cache-data
  ceph osd tier cache-mode cache-data writeback
  ceph osd tier set-overlay cephfs-data cache-data
  ceph osd pool set cache-data hit_set_type bloom
  ceph osd pool set cache-data target_max_bytes 1099511627776 

The file system was created from the cephfs-metadata and cephfs-data pools

After adding a lot of data to this and waiting for the pools to idle down and 
stabilize I removing the file system content with rm. I am seeing very strange 
behavior, the file system remove was quick, and then it started removing the 
data from the pools. However it appears to be reading the data from the erasure 
coded pool and creating empty content in the cache pool.

At its peak capacity the system looked like this:

NAMEID CATEGORY USED   %USED MAX AVAIL 
OBJECTS DIRTY READ  WRITE 
cache-data  21 -  791G  5.25 6755G  
256302  140k 22969 2138k 
cephfs-data 22 - 4156G 27.54 4503G 
1064086 1039k 51271 1046k 


2 hours later it looked like this:

NAMEID CATEGORY USED   %USED MAX AVAIL 
OBJECTS DIRTY READ  WRITE 
cache-data  21 -  326G  2.17 7576G  
702142  559k 22969 2689k 
cephfs-data 22 - 3964G 26.27 5051G 
1014842  991k  476k 1143k 

The object count in the erasure coded pool has gone down a little, the count in 
the cache pool has gone up a lot, there has been a lot of read activity in the 
erasure coded pool and write activity into both pools. The used count in the 
cache pool is also going down. It looks like the cache pool is gaining 9 
objects for each one removed from the erasure code pool. Looking at the actual 
files being created by the OSD for this, they are empty.

What is going on here? It looks like this will take a day or so to complete at 
this rate of progress.

The ceph version here is the master branch from a couple of days ago.

Thanks

  Steve Lord



--
The information contained in this transmission may be confidential. Any 
disclosure, copying, or further distribution of confidential information is not 
permitted unless such privilege is explicitly granted in writing by Quantum. 
Quantum reserves the right to have electronic communications, including email 
and attachments, sent across its networks filtered through anti virus and spam 
software programs and retain such messages in order to comply with applicable 
data security and retention requirements. Quantum is not responsible for the 
proper and complete transmission of the substance of this communication or for 
any delay in its receipt.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] pg dump question

2016-02-04 Thread Gregory Farnum

On Thu, Feb 4, 2016 at 10:23 AM, WRIGHT, JON R (JON R)
 wrote:
> New ceph user, so a basic question
>
> I have a newly setup Ceph cluster.   Seems to be working ok.  But . . .
>
> I'm looking at the output of ceph pg dump, and I see that in the osdstat
> list at the bottom of the output, there are empty brackets [] in the 'hb
> out' column for all of the OSDs.  It seems that there should OSD ids listed
> in those brackets because each OSD should be reporting heartbeats out to at
> least some other OSDs.  Or so I think.
>
> On the other hand, the hb in column has non-null entries -- meaning (I
> think) that the OSD is receiving heartbeats from at least some other OSDs.
>
> It's curious to me, because if OSD 1 is reporting heartbeats in from OSD2,
> why wouldn't OSD 2 list OSD 1 in its hb out list?
>
> Is there an answer for this, and is it something to be concerned about?

The heartbeat system has been rewritten a couple of times. I believe
at one point we eliminated the in/out concept and so all the heartbeat
peers are being dumped in a single list now, but we maintained both
lists for compatibility of parsers.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph and hadoop (fstab insted of CephFS)

2016-02-04 Thread Zoltan Arnold Nagy

Might be totally wrong here, but it’s not layering them but replacing hdfs:// 
URLs with ceph:// URLs so all the mapreduce/spark/hbase/whatever is on top can 
use CephFS directly which is not a bad thing to do (if it works) :-)

> On 02 Feb 2016, at 16:50, John Spray  wrote:
> 
> On Tue, Feb 2, 2016 at 3:42 PM, Jose M  > wrote:
>> Hi,
>> 
>> 
>> One simple question, in the ceph docs says that to use Ceph as an HDFS
>> replacement, I can use the CephFs Hadoop plugin
>> (http://docs.ceph.com/docs/master/cephfs/hadoop/).
>> 
>> 
>> What I would like to know if instead of using the plugin, I can mount ceph
>> in fstab and then point hdfs dirs (namenode, datanode, etc) to this mounted
>> "ceph" dirs, instead of native local dirs.
>> 
>> I understand that maybe will involve more configuration steps (configuring
>> fstab in each node), but will this work? Is there any problem with this type
>> of configuration?
> 
> Without being a big HDFS expert, it seems like you would be
> essentially putting one distributed filesystem on top of another
> distributed filesystem.  I don't know if you're going to find anything
> that breaks as such, but it's probably not a good idea.
> 
> John
> 
>> 
>> Thanks in advance,
>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> 
>> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Vacaciones Hernán Pinto

2016-02-04 Thread hpinto

Me encuentro de Vacaciones hasta el Viernes 12 de Febrero. 

Para la atencion de requerimientos de Proyectos en curso favor canalizar con 
Ariel Muñoz al correo electronico amu...@iia.cl o al Fono +56228401000.

Para la atención de Incidentes favor canalizar al correo 
soporte.inter...@iia.cl o al +56228401100

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Feb Ceph Developer Monthly

2016-02-04 Thread Patrick McGarry

Hey cephers,

For those of you that weren’t able to make the inaugural Ceph
Developer Monthly (CDM) call yesterday (or for those who wish to
review), the recording is now available on the Ceph YouTube channel:

https://youtu.be/0gIqgxrmrJw

If you are working on something related to Ceph, or would like to see
how the sausage is made, feel free to join us next month. The March
meeting is tentatively scheduled for 02 Mar @ 21:00 EST:

http://tracker.ceph.com/projects/ceph/wiki/Planning

If you have any questions, comments, or anything for the good of the
cause, please don’t hesitate to contact me. Thanks.

-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] pg dump question

2016-02-04 Thread WRIGHT, JON R (JON R)


New ceph user, so a basic question

I have a newly setup Ceph cluster.   Seems to be working ok.  But . . .

I'm looking at the output of ceph pg dump, and I see that in the osdstat 
list at the bottom of the output, there are empty brackets [] in the 'hb 
out' column for all of the OSDs.  It seems that there should OSD ids 
listed in those brackets because each OSD should be reporting heartbeats 
out to at least some other OSDs.  Or so I think.


On the other hand, the hb in column has non-null entries -- meaning (I 
think) that the OSD is receiving heartbeats from at least some other OSDs.


It's curious to me, because if OSD 1 is reporting heartbeats in from 
OSD2, why wouldn't OSD 2 list OSD 1 in its hb out list?


Is there an answer for this, and is it something to be concerned about?

Thanks
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Optimal OSD count for SSDs / NVMe disks

2016-02-04 Thread Zoltan Arnold Nagy

One option you left out: you could put the journals on NVMe plus use the leftover space for a writeback bcache device which caches those 5 OSDs. This is exactly what I’m testing at the moment - 4xNVMe + 20 disks per box.Or just use the NVMe itself as a bcache cache device (don’t partition it) and let the journal be a file on the writeback-cached OSD :-)Might be interesting to compare this to the cache pool version.I’d love to hear other’s opinions on this!On 03 Feb 2016, at 13:01, Sascha Vogt  wrote:Hi all,we recently tried adding a cache tier to our ceph cluster. We had 5spinning disks per hosts with a single journal NVMe disk, hosting the 5journals (1 OSD per spinning disk). We have 4 hosts up to now, sooverall 4 NVMes hosting 20 journals for 20 spinning disks.As we had some space left on the NVMes so we made two additionalpartitions on each NVMe and created a 4 OSD cache tier.To our surprise the 4 OSD cache pool was able to deliver the sameperformance then the previous 20 OSD pool while reducing the OPs on thespinning disk to zero as long as the cache pool was sufficient to holdall / most data (ceph is used for very short living KVM virtual machineswhich do pretty heavy disk IO).As we don't need that much more storage right now we decided to extendour cluster by adding 8 additional NVMe disks solely as a cache pool andfreeing the journal NVMes again. Now the question is: How to organizethe OSDs on the NVMe disks (2 per host)?As the NVMes peak around 5-7 concurrent sequential writes (tested withfio) I thought about using 5 OSDs per NVMe. That would mean 10partitions (5 journals, 5 data). On the other hand the NVMes are only400GB large, so that would result in OSD disk sizes for <80 GB(depending on the journal size).Would it make sense to skip the separate Journal partition and leave thejournal on the data disk itself and limitting it to a rather smallamount (lets say 1 GB or even less?) as SSDs typically don't likesequential writes anyway?Or, if I leave journal and data on separate partitions should I reducethe number of OSDs per disk to 3 as Ceph will most likly write tojournal and data in parallel and I therefore already get 6 parallel"threads" of IO?Any feedback is highly appreciated :)Greetings-Sascha-___ceph-users mailing listceph-users@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Upgrading with mon & osd on same host

2016-02-04 Thread Robert LeBlanc

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Just make sure that your monitors and OSDs are on the very latest of
Hammer or else your Infernalis OSDs won't activate.
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, Feb 4, 2016 at 12:23 AM, Mika c  wrote:
> Hi,
> >* Do the packages (Debian) restart the services upon upgrade?
> No need, restart by yourself.
>
>>Do I need to actually stop all OSDs, or can I upgrade them one by one?
> No need to stop. Just upgrade osd server one by one and restart each osd
> daemons.
>
>
>
> Best wishes,
> Mika
>
>
> 2016-02-03 18:55 GMT+08:00 Udo Waechter :
>>
>> Hi,
>>
>> I would like to upgrade my ceph cluster from hammer to infernalis.
>>
>> I'm reading the upgrade notes, that I need to upgrade & restart the
>> monitors first, then the OSDs.
>>
>> Now, my cluster has OSDs and Mons on the same hosts (I know that should
>> not be the case, but it is :( ).
>>
>> I'm just wondering:
>> * Do the packages (Debian) restart the services upon upgrade?
>>
>>
>> In theory it should work this way:
>>
>> * install / upgrade the new packages
>> * restart all mons
>> * stop OSD one by one and change the user accordingly.
>>
>> Another question then:
>>
>> Do I need to actually stop all OSDs, or can I upgrade them one by one?
>> I don't want to take the whole cluster down :(
>>
>> Thanks very much,
>> udo.
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.4
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWs4YFCRDmVDuy+mK58QAAdjUP/RNYDkRYaPuspNei14sh
XrM23GLCsmuv7jderKwOkG2wsIQOxR86E5F/dUEnB0UB+CKAvYBi3w2cHNc1
PrZoWkPkiEy2+bsQW65CoPK4UoghFoNGWABSPIDgcNjrblbTJ+Ph0FEfXSNn
PnlZ40/NySiCypPbKTOFag8o30eOqIO1UDjWqTdeQWhVKmQpAGWEAMc/A1Dk
YHLqq1MQOiZ1Zh16Bx664sspR68GYnWw57MF5bVterEahlhm8/n17rJVDFT/
440+Idph3GEpIWqiXLLYM8nCIiwsXO30OxdwTVVpoDrszh782E2jAMkW9cCs
IXBkZRgq4M6Gz4P76BWiNJN0CeTsA0NUwQVZQl9cndeLgyqhCzFS8825ixfl
fFFiz3RFqluVzP55V+D3IEFZHlbiYMZtx1HbrjWR1UG1Q40PnB3XxwxiNBDT
dKsjpGMYeHs/KPUdMaWraQqBxjWC1bvc00eqVhQZm/Xz+jniitr+DGfh9afi
sTYYiHJcURgpvvbi77oOglzYfMes+b5oOxJT5KII2eEDothG6GF63Bn7c75W
7BjjlR4ugmD6kO4PsyF2NisfdL7IpEQe/aiieGPU10QRvVfRdu5LEGd6/An2
YxvAhzQxx+gJzknBDlbh95wcdVy/MHKDO3XoK1FXOpRaejCcPLRhu3rW/vgy
ZRJc
=rHFo
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Performance issues related to scrubbing

2016-02-04 Thread Cullen King

Replies in-line:

On Wed, Feb 3, 2016 at 9:54 PM, Christian Balzer 
wrote:

>
> Hello,
>
> On Wed, 3 Feb 2016 17:48:02 -0800 Cullen King wrote:
>
> > Hello,
> >
> > I've been trying to nail down a nasty performance issue related to
> > scrubbing. I am mostly using radosgw with a handful of buckets containing
> > millions of various sized objects. When ceph scrubs, both regular and
> > deep, radosgw blocks on external requests, and my cluster has a bunch of
> > requests that have blocked for > 32 seconds. Frequently OSDs are marked
> > down.
> >
> From my own (painful) experiences let me state this:
>
> 1. When your cluster runs out of steam during deep-scrubs, drop what
> you're doing and order more HW (OSDs).
> Because this is a sign that it would also be in trouble when doing
> recoveries.
>

When I've initiated recoveries from working on the hardware the cluster
hasn't had a problem keeping up. It seems that it only has a problem with
scrubbing, meaning it feels like the IO pattern is drastically different. I
would think that with scrubbing I'd see something closer to bursty
sequential reads, rather than just thrashing the drives with a more random
IO pattern, especially given our low cluster utilization.

>
> 2. If you cluster is inconvenienced by even mere scrubs, you're really in
> trouble.
> Threaten the penny pincher with bodily violence and have that new HW
> phased in yesterday.
>

I am the penny pincher, biz owner, dev and ops guy for
http://ridewithgps.com :) More hardware isn't an issue, it just feels
pretty crazy to have this low of performance on a 12 OSD system. Granted,
that feeling isn't backed by anything concrete! In general, I like to
understand the problem before I solve it with hardware, though I am
definitely not averse to it. I already ordered 6 more 4tb drives along with
the new journal SSDs, anticipating the need.

As you can see from the output of ceph status, we are not space hungry by
any means.

>
> > According to atop, the OSDs being deep scrubbed are reading at only 5mb/s
> > to 8mb/s, and a scrub of a 6.4gb placement group takes 10-20 minutes.
> >
> > Here's a screenshot of atop from a node:
> > https://s3.amazonaws.com/rwgps/screenshots/DgSSRyeF.png
> >
> This looks familiar.
> Basically at this point in time the competing read request for all the
> objects clash with write requests and completely saturate your HD (about
> 120 IOPS and 85% busy according to your atop screenshot).
>

In your experience would the scrub operation benefit from a bigger
readahead? Meaning is it more sequential than random reads? I already
bumped /sys/block/sd{x}/queue/read_ahead_kb to 512kb.

About half of our reads are on objects with an average size of 40kb (map
thumbnails), and the other half are on photo thumbs with a size between
10kb and 150kb.

After doing a little more researching, I came across this:

http://tracker.ceph.com/projects/ceph/wiki/Optimize_Newstore_for_massive_small_object_storage

Sounds like I am probably running into issues with lots of random read IO,
combined with known issues around small files. To give an idea, I have
about 15 million small map thumbnails stored in my two largest buckets, and
I am pushing out about 30 requests per second right now from those two
buckets.

> There are ceph configuration options that can mitigate this to some
> extend and which I don't see in your config, like
> "osd_scrub_load_threshold" and "osd_scrub_sleep" along with the various IO
> priority settings.
> However the points above still stand.
>

Yes, I have a running series of notes of config options to try out, just
wanted to touch base with other community members before shooting in the
dark.

>
> XFS defragmentation might help, significantly if your FS is badly
> fragmented. But again, this is only a temporary band-aid.
>
> > First question: is this a reasonable speed for scrubbing, given a very
> > lightly used cluster? Here's some cluster details:
> >
> > deploy@drexler:~$ ceph --version
> > ceph version 0.94.1-5-g85a68f9 (85a68f9a8237f7e74f44a1d1fbbd6cb4ac50f8e8)
> >
> >
> > 2x Xeon E5-2630 per node, 64gb of ram per node.
> >
> More memory can help by keeping hot objects in the page cache (so the
> actual disks need not be read and can write at their full IOPS capacity).
> A lot of memory (and the correct sysctl settings) will also allow for a
> large SLAB space, keeping all those directory entries and other bits in
> memory without having to go to disk to get them.
>
> You seem to be just fine CPU wise.
>

I thought about bumping each node up to 128gb of ram as another cheap
insurance policy. I'll try that after the other changes. I'd like to know
why so I'll try and change one thing at a time, though I am also just eager
to have this thing stable.

>
> >
> > deploy@drexler:~$ ceph status
> > cluster 234c6825-0e2b-4256-a710-71d29f4f023e
> >  health HEALTH_WARN
> > 118 requests are blocked > 32 sec
> >  monmap e1: 3 mons at {drexler=
> > 10.0.0.

[ceph-users] hb in and hb out from pg dump

2016-02-04 Thread WRIGHT, JON R (JON R)


New ceph user, so a basic question  :)

I have a newly setup Ceph cluster.   Seems to be working ok.  But . . .

I'm looking at the output of ceph pg dump, and I see that in the osdstat 
list at the bottom of the output, there are empty brackets [] in the 'hb 
out' column for all of the OSDs.  It seems that there should OSD ids 
listed in those brackets because each OSD should be reporting heartbeats 
out to at least some other OSDs.  Or so I think.


On the other hand, the hb in column has non-null entries -- meaning (I 
think) that the OSD is receiving heartbeats from at least some other OSDs.


It's curious to me, because if OSD 1 is reporting heartbeats in from 
OSD2, why wouldn't OSD 2 list OSD 1 in its hb out list?


Is there an answer for this, and is it something to be concerned about?

Thanks



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph 9.2.0 mds cluster went down and now constantly crashes with Floating point exception

2016-02-04 Thread Gregory Farnum

On Thu, Feb 4, 2016 at 1:42 AM, Kenneth Waegeman
 wrote:
> Hi,
>
> Hi, we are running ceph 9.2.0.
> Overnight, our ceph state went to 'mds mds03 is laggy' . When I checked the
> logs, I saw this mds crashed with a stacktrace. I checked the other mdss,
> and I saw the same there.
> When I try to start the mds again, I get again a stacktrace and it won't
> come up:
>
>  -12> 2016-02-04 10:23:46.837131 7ff9ea570700  1 --
> 10.141.16.2:6800/193767 <== osd.146 10.141.16.25:6800/7036 1 
> osd_op_reply(207 15ef982. [stat] v0'0 uv22184 ondisk = 0) v6
>  187+0+16 (113
> 2261152 0 506978568) 0x7ffa171ae940 con 0x7ffa189cc3c0
> -11> 2016-02-04 10:23:46.837317 7ff9ed6a1700  1 --
> 10.141.16.2:6800/193767 <== osd.136 10.141.16.24:6800/6764 6 
> osd_op_reply(209 148aaac. [delete] v0'0 uv23797 ondisk = -2 ((2)
> No such file o
> r directory)) v6  187+0+0 (64699207 0 0) 0x7ffa171acb00 con
> 0x7ffa014fd9c0
> -10> 2016-02-04 10:23:46.837406 7ff9ec994700  1 --
> 10.141.16.2:6800/193767 <== osd.36 10.141.16.14:6800/5395 5 
> osd_op_reply(175 15f631f. [stat] v0'0 uv22466 ondisk = 0) v6
>  187+0+16 (1037
> 61047 0 2527067705) 0x7ffa08363700 con 0x7ffa189ca580
>  -9> 2016-02-04 10:23:46.837463 7ff9eba85700  1 --
> 10.141.16.2:6800/193767 <== osd.47 10.141.16.15:6802/7128 2 
> osd_op_reply(211 148aac8. [delete] v0'0 uv22990 ondisk = -2 ((2)
> No such file or
>   directory)) v6  187+0+0 (1138385695 0 0) 0x7ffa01cd0dc0 con
> 0x7ffa189cadc0
>  -8> 2016-02-04 10:23:46.837468 7ff9eb27d700  1 --
> 10.141.16.2:6800/193767 <== osd.16 10.141.16.12:6800/5739 2 
> osd_op_reply(212 148aacd. [delete] v0'0 uv23991 ondisk = -2 ((2)
> No such file or
>   directory)) v6  187+0+0 (1675093742 0 0) 0x7ffa171ac840 con
> 0x7ffa189cb760
>  -7> 2016-02-04 10:23:46.837477 7ff9eab76700  1 --
> 10.141.16.2:6800/193767 <== osd.66 10.141.16.17:6800/6353 2 
> osd_op_reply(210 148aab9. [delete] v0'0 uv24583 ondisk = -2 ((2)
> No such file or
>   directory)) v6  187+0+0 (603192739 0 0) 0x7ffa19054680 con
> 0x7ffa189cbce0
>  -6> 2016-02-04 10:23:46.838140 7ff9f0bcf700  1 --
> 10.141.16.2:6800/193767 <== osd.2 10.141.16.2:6802/126856 43 
> osd_op_reply(121 200.9d96 [write 1459360~980] v943'4092 uv4092 ondisk =
> 0) v6  179+0+0 (3939130488 0 0) 0x7ffa01590100 con 0x7ffa014fab00
>  -5> 2016-02-04 10:23:46.838342 7ff9f0bcf700  1 --
> 10.141.16.2:6800/193767 <== osd.2 10.141.16.2:6802/126856 44 
> osd_op_reply(124 200.9d96 [write 1460340~956] v943'4093 uv4093 ondisk =
> 0) v6  179+0+0 (1434265886 0 0) 0x7ffa01590100 con 0x7ffa014fab00
>  -4> 2016-02-04 10:23:46.838531 7ff9f0bcf700  1 --
> 10.141.16.2:6800/193767 <== osd.2 10.141.16.2:6802/126856 45 
> osd_op_reply(126 200.9d96 [write 1461296~954] v943'4094 uv4094 ondisk =
> 0) v6  179+0+0 (25292940 0 0) 0x7ffa01590100 con 0x7ffa014fab00
>  -3> 2016-02-04 10:23:46.838700 7ff9ecd98700  1 --
> 10.141.16.2:6800/193767 <== osd.57 10.141.16.16:6802/7067 3 
> osd_op_reply(199 15ef976. [stat] v0'0 uv22557 ondisk = 0) v6
>  187+0+16 (354652996 0 2244692791) 0x7ffa171ade40 con 0x7ffa189ca160
>  -2> 2016-02-04 10:23:46.839301 7ff9ed8a3700  1 --
> 10.141.16.2:6800/193767 <== osd.107 10.141.16.21:6802/7468 3 
> osd_op_reply(115 1625476. [stat] v0'0 uv22587 ondisk = 0) v6
>  187+0+16 (664308076 0 998461731) 0x7ffa08363c80 con 0x7ffa014fdb20
>  -1> 2016-02-04 10:23:46.839322 7ff9f0bcf700  1 --
> 10.141.16.2:6800/193767 <== osd.2 10.141.16.2:6802/126856 46 
> osd_op_reply(128 200.9d96 [write 1462250~954] v943'4095 uv4095 ondisk =
> 0) v6  179+0+0 (1379768629 0 0) 0x7ffa01590100 con 0x7ffa014fab00
>   0> 2016-02-04 10:23:46.839379 7ff9f30d8700 -1 *** Caught signal
> (Floating point exception) **
>   in thread 7ff9f30d8700
>
>   ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
>   1: (()+0x4b6fa2) [0x7ff9fd091fa2]
>   2: (()+0xf100) [0x7ff9fbfd3100]
>   3: (StrayManager::_calculate_ops_required(CInode*, bool)+0xa2)
> [0x7ff9fcf0adc2]
>   4: (StrayManager::enqueue(CDentry*, bool)+0x169) [0x7ff9fcf10459]
>   5: (StrayManager::__eval_stray(CDentry*, bool)+0xa49) [0x7ff9fcf111c9]
>   6: (StrayManager::eval_stray(CDentry*, bool)+0x1e) [0x7ff9fcf113ce]
>   7: (MDCache::scan_stray_dir(dirfrag_t)+0x13d) [0x7ff9fce6741d]
>   8: (MDSInternalContextBase::complete(int)+0x1e3) [0x7ff9fcff4993]
>   9: (MDSRank::_advance_queues()+0x382) [0x7ff9fcdd4652]
>   10: (MDSRank::ProgressThread::entry()+0x4a) [0x7ff9fcdd4aca]
>   11: (()+0x7dc5) [0x7ff9fbfcbdc5]
>   12: (clone()+0x6d) [0x7ff9faeb621d]
>
> Does someone has an idea? We can't use our fs right now..

Hey, fun! Just looking for FPE opportunities in that function, it
looks like someone managed to set either the object size or stripe
count to 0 on some of your files. Is that possible?
As for repair chances...hmm. I think you'll ne

Re: [ceph-users] Default CRUSH Weight Set To 0 ?

2016-02-04 Thread Burkhard Linke


Hi,

On 02/04/2016 03:17 PM, Kyle Harris wrote:

Hello,

I have been working on a very basic cluster with 3 nodes and a single 
OSD per node.  I am using Hammer installed on CentOS 7 
(ceph-0.94.5-0.el7.x86_64) since it is the LTS version.  I kept 
running into an issue of not getting past the status of 
undersized+degraded+peered.  I finally discovered the problem was that 
in the default CRUSH map, the weight assigned is 0.  I changed the 
weight and everything came up as it should.  I did the same test using 
the Infernalis release and everything worked as expected as the weight 
has been changed to a default of 321.


- Is this a bug or by design and if the latter, why? Perhaps I'm 
missing something?

- Has anyone else ran into this?
- Am I correct in assuming a weight of 0 won't allow the OSDs to be 
used or is there some other purpose for this?
The default weight is the size of the OSD in tera bytes. Did you use a 
very small OSD partition for test purposes, e.g. 20 GB? In that case the 
weight is rounded and results in an effective weight of 0.0. As a result 
the OSD will not be used for data storage.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] mds0: Client X failing to respond to capability release

2016-02-04 Thread Yan, Zheng


> On Feb 4, 2016, at 17:00, Michael Metz-Martini | SpeedPartner GmbH 
>  wrote:
> 
> Hi,
> 
> Am 04.02.2016 um 09:43 schrieb Yan, Zheng:
>> On Thu, Feb 4, 2016 at 4:36 PM, Michael Metz-Martini | SpeedPartner
>> GmbH  wrote:
>>> Am 03.02.2016 um 15:55 schrieb Yan, Zheng:
> On Feb 3, 2016, at 21:50, Michael Metz-Martini | SpeedPartner GmbH 
>  wrote:
> Am 03.02.2016 um 12:11 schrieb Yan, Zheng:
>>> On Feb 3, 2016, at 17:39, Michael Metz-Martini | SpeedPartner GmbH 
>>>  wrote:
>>> Am 03.02.2016 um 10:26 schrieb Gregory Farnum:
 On Tue, Feb 2, 2016 at 10:09 PM, Michael Metz-Martini | SpeedPartner
> 2016-02-03 14:42:25.581840 7fadfd280700  0 log_channel(default) log
> [WRN] : 7 slow requests, 6 included below; oldest blocked for >
> 62.125785 secs
> 2016-02-03 14:42:25.581849 7fadfd280700  0 log_channel(default) log
> [WRN] : slow request 62.125785 seconds old, received at 2016-02-03
> 14:41:23.455812: client_request(client.10199855:1313157 getattr
> pAsLsXsFs #100815bd349 2016-02-03 14:41:23.452386) currently failed to
> rdlock, waiting
 
 This seems like dirty page writeback is too slow.  Is there any hung OSD 
 request in /sys/kernel/debug/ceph/xxx/osdc?
>>> Where should I check? Client or mds? Do I have to enable something to
>>> get this details? Directory /sys/kernel/debug/ceph/ seems to be missing.
>> On client with kernel ceph mount. If there is no debugfs, mount
>> debugfs first (mount -t debugfs /sys/kernel/debug /sys/kernel/debug)
> Got it. http://www.michael-metz.de/osdc.txt.gz (about 500kb uncompressed)

That’s quite a lot requests. Could you pick some requests in osdc, and check 
how long do these requests last.

> 
> By looking around I found caps .. $ cat caps
> total   305975
> avail   2
> used305973
> reserved0
> min 1024
> 
> Somehow related? avail=2 is low ;-)

This is not problem.



> 
> -- 
> Kind regards
> Michael Metz-Martini

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Default CRUSH Weight Set To 0 ?

2016-02-04 Thread Kyle Harris

Hello,

I have been working on a very basic cluster with 3 nodes and a single OSD
per node.  I am using Hammer installed on CentOS 7
(ceph-0.94.5-0.el7.x86_64) since it is the LTS version.  I kept running
into an issue of not getting past the status of
undersized+degraded+peered.  I finally discovered the problem was that in
the default CRUSH map, the weight assigned is 0.  I changed the weight and
everything came up as it should.  I did the same test using the Infernalis
release and everything worked as expected as the weight has been changed to
a default of 321.

- Is this a bug or by design and if the latter, why?  Perhaps I'm missing
something?
- Has anyone else ran into this?
- Am I correct in assuming a weight of 0 won't allow the OSDs to be used or
is there some other purpose for this?

Hopefully this will help others that may run into this same situation.

Thank you.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Optimal OSD count for SSDs / NVMe disks

2016-02-04 Thread Wade Holler

First on your comment of:

"we found that during times where the cache pool flushed to
the storage pool client IO took a severe hit"

We found the same thing.
http://blog.wadeit.io/ceph-cache-tier-performance-random-writes/
-- I don't claim this is a great write up, and not what a lot of folks are
interested in but it is what I was after.

Great on your fio test.  However take a look at the response time.
Naturally it will increase after 4-5 concurrent writes.  Which is of course
what you were saying and is correct.  However, I think we can generally
accept a slightly higher response time and therefore iodepth>1 is a more
real world test.  Just my thoughts. You did the right thing, and tested
well.

Some might not like it , but I like Sebastien's journal size calculation
and it has served me well:
http://slides.com/sebastienhan/ceph-performance-and-benchmarking#/24

Cheers
Wade





On Thu, Feb 4, 2016 at 7:24 AM Sascha Vogt  wrote:

> Hi,
>
> Am 04.02.2016 um 12:59 schrieb Wade Holler:
> > You referenced parallel writes for journal and data. Which is default
> > for btrfs but but XFS. Now you are mentioning multiple parallel writes
> > to the drive , which of course yes will occur.
> Ah, that is good to know. So if I want to create more "parallelism" I
> should use btrfs then. Thanks a lot, that's a very critical bit of
> information :)
>
> > Also Our Dell 400 Gb NVMe drives do not top out around 5-7 sequential
> > writes as you mentioned. That would be 5-7 random writes from a drives
> > perspective and the NVMe drives can do many times that.
> Hm, I used the following fio bench from [1]:
> fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k
> --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting
> --name=journal-test
>
> Our disks showed the following bandwidths: (# is the numjobs
> paramenter):
>
> #1: write: io=1992.2MB, bw=33997KB/s, iops=8499
> #2: write: io=5621.6MB, bw=95940KB/s, iops=23984
> #3: write: io=8062.8MB, bw=137602KB/s, iops=34400
> #4: write: io=9114.1MB, bw=155545KB/s, iops=38886
> #5: write: io=8860.7MB, bw=151169KB/s, iops=37792
>
> Also for more jobs (tried up to 8) bandwidth stayed at around 150MB/s
> and around 37k iops. So I figured that around 5 should be the sweet spot
> in terms for journals on a single disk.
>
> > I would park it at 5-6 partitions per NVMe , journal on the same disk.
> > Frequently I want more concurrent operations , rather than all out
> > throughput.
> For journal on the same partition, should I limit the size of the
> journal size? If yes, what should be the limit? Rather large or rather
> small?
>
> Greetings
> -Sascha-
>
> [1]
> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph Stats back to Calamari

2016-02-04 Thread Daniel Rolfe

Hi John, Thanks for the help, it was related to the calamari branch of
diamond not working with the latest version of ceph

I've given you credit on the github issue also

https://github.com/ceph/calamari/issues/384

[image: Inline image 1]

On Mon, Feb 1, 2016 at 11:22 PM, John Spray  wrote:

> The "assert path[-1] == 'type'" is the error you get when using the
> calamari diamond branch with a >= infernalis version of Ceph (where
> new fields were added to the perf schema output).  No idea if anyone
> has worked on updating Calamari+Diamond for latest ceph.
>
> John
>
> On Mon, Feb 1, 2016 at 12:09 PM, Daniel Rolfe 
> wrote:
> > I can see the is ok files are there
> >
> > root@ceph1:/var/run/ceph# ls -la
> > total 0
> > drwxrwx---  2 ceph ceph  80 Feb  1 10:51 .
> > drwxr-xr-x 18 root root 640 Feb  1 10:52 ..
> > srwxr-xr-x  1 ceph ceph   0 Feb  1 10:51 ceph-mon.ceph1.asok
> > srwxr-xr-x  1 root root   0 Jan 27 15:08 ceph-osd.0.asok
> > root@ceph1:/var/run/ceph#
> > root@ceph1:/var/run/ceph#
> > root@ceph1:/var/run/ceph#
> >
> >
> > Running diamond in debug show the below
> >
> > [2016-02-01 10:55:23,774] [Thread-1] Collecting data from:
> NetworkCollector
> > [2016-02-01 10:56:23,484] [Thread-1] Collecting data from: CPUCollector
> > [2016-02-01 10:56:23,487] [Thread-6] Collecting data from:
> MemoryCollector
> > [2016-02-01 10:56:23,489] [Thread-7] Collecting data from:
> SockstatCollector
> > [2016-02-01 10:56:23,768] [Thread-1] Collecting data from: CephCollector
> > [2016-02-01 10:56:23,768] [Thread-1] gathering service stats for
> > /var/run/ceph/ceph-mon.ceph1.asok
> > [2016-02-01 10:56:24,094] [Thread-1] Traceback (most recent call last):
> >   File "/usr/lib/pymodules/python2.7/diamond/collector.py", line 412, in
> > _run
> > self.collect()
> >   File "/usr/share/diamond/collectors/ceph/ceph.py", line 464, in collect
> > self._collect_service_stats(path)
> >   File "/usr/share/diamond/collectors/ceph/ceph.py", line 450, in
> > _collect_service_stats
> > self._publish_stats(counter_prefix, stats, schema, GlobalName)
> >   File "/usr/share/diamond/collectors/ceph/ceph.py", line 305, in
> > _publish_stats
> > assert path[-1] == 'type'
> > AssertionError
> >
> > [2016-02-01 10:56:24,096] [Thread-8] Collecting data from:
> > LoadAverageCollector
> > [2016-02-01 10:56:24,098] [Thread-1] Collecting data from:
> VMStatCollector
> > [2016-02-01 10:56:24,099] [Thread-1] Collecting data from:
> > DiskUsageCollector
> > [2016-02-01 10:56:24,104] [Thread-9] Collecting data from:
> > DiskSpaceCollector
> >
> >
> >
> > Check the md5 on the file returns the below:
> >
> > root@ceph1:/var/run/ceph# md5sum
> /usr/share/diamond/collectors/ceph/ceph.py
> > aeb3915f8ac7fdea61495805d2c99f33
> /usr/share/diamond/collectors/ceph/ceph.py
> > root@ceph1:/var/run/ceph#
> >
> >
> >
> > I've found that replacing the ceph.py file with the below stops the
> diamond
> > error
> >
> >
> >
> https://raw.githubusercontent.com/BrightcoveOS/Diamond/master/src/collectors/ceph/ceph.py
> >
> > root@ceph1:/usr/share/diamond/collectors/ceph# md5sum ceph.py
> > 13ac74ce0df39a5def879cb5fc530015  ceph.py
> >
> >
> > [2016-02-01 11:14:33,116] [Thread-42] Collecting data from:
> MemoryCollector
> > [2016-02-01 11:14:33,117] [Thread-1] Collecting data from: CPUCollector
> > [2016-02-01 11:14:33,123] [Thread-43] Collecting data from:
> > SockstatCollector
> > [2016-02-01 11:14:35,453] [Thread-1] Collecting data from: CephCollector
> > [2016-02-01 11:14:35,454] [Thread-1] checking
> > /var/run/ceph/ceph-mon.ceph1.asok
> > [2016-02-01 11:14:35,552] [Thread-1] checking
> /var/run/ceph/ceph-osd.0.asok
> > [2016-02-01 11:14:35,685] [Thread-44] Collecting data from:
> > LoadAverageCollector
> > [2016-02-01 11:14:35,686] [Thread-1] Collecting data from:
> VMStatCollector
> > [2016-02-01 11:14:35,687] [Thread-1] Collecting data from:
> > DiskUsageCollector
> > [2016-02-01 11:14:35,692] [Thread-45] Collecting data from:
> > DiskSpaceCollector
> >
> >
> > But after all that it's still NOT working
> >
> > What diamond version are you running ?
> >
> > I'm running Diamond version 3.4.67
> >
> > On Mon, Feb 1, 2016 at 11:01 PM, Daniel Rolfe  >
> > wrote:
> >>
> >> I can see the is ok files are there
> >>
> >> root@ceph1:/var/run/ceph# ls -la
> >> total 0
> >> drwxrwx---  2 ceph ceph  80 Feb  1 10:51 .
> >> drwxr-xr-x 18 root root 640 Feb  1 10:52 ..
> >> srwxr-xr-x  1 ceph ceph   0 Feb  1 10:51 ceph-mon.ceph1.asok
> >> srwxr-xr-x  1 root root   0 Jan 27 15:08 ceph-osd.0.asok
> >> root@ceph1:/var/run/ceph#
> >> root@ceph1:/var/run/ceph#
> >> root@ceph1:/var/run/ceph#
> >>
> >>
> >> Running diamond in debug show the below
> >>
> >> [2016-02-01 10:55:23,774] [Thread-1] Collecting data from:
> >> NetworkCollector
> >> [2016-02-01 10:56:23,484] [Thread-1] Collecting data from: CPUCollector
> >> [2016-02-01 10:56:23,487] [Thread-6] Collecting data from:
> MemoryCollector
> >> [2016-02-01 10:56:23,489] [Thread-7] Collecting data from:
> >> Sockstat

Re: [ceph-users] Optimal OSD count for SSDs / NVMe disks

2016-02-04 Thread Sascha Vogt

Hi,

Am 04.02.2016 um 12:59 schrieb Wade Holler:
> You referenced parallel writes for journal and data. Which is default
> for btrfs but but XFS. Now you are mentioning multiple parallel writes
> to the drive , which of course yes will occur.
Ah, that is good to know. So if I want to create more "parallelism" I
should use btrfs then. Thanks a lot, that's a very critical bit of
information :)

> Also Our Dell 400 Gb NVMe drives do not top out around 5-7 sequential
> writes as you mentioned. That would be 5-7 random writes from a drives
> perspective and the NVMe drives can do many times that.
Hm, I used the following fio bench from [1]:
fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k
--numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting
--name=journal-test

Our disks showed the following bandwidths: (# is the numjobs
paramenter):

#1: write: io=1992.2MB, bw=33997KB/s, iops=8499
#2: write: io=5621.6MB, bw=95940KB/s, iops=23984
#3: write: io=8062.8MB, bw=137602KB/s, iops=34400
#4: write: io=9114.1MB, bw=155545KB/s, iops=38886
#5: write: io=8860.7MB, bw=151169KB/s, iops=37792

Also for more jobs (tried up to 8) bandwidth stayed at around 150MB/s
and around 37k iops. So I figured that around 5 should be the sweet spot
in terms for journals on a single disk.

> I would park it at 5-6 partitions per NVMe , journal on the same disk.
> Frequently I want more concurrent operations , rather than all out
> throughput.
For journal on the same partition, should I limit the size of the
journal size? If yes, what should be the limit? Rather large or rather
small?

Greetings
-Sascha-

[1]http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Optimal OSD count for SSDs / NVMe disks

2016-02-04 Thread Sascha Vogt

Hi Robert,

Am 04.02.2016 um 00:45 schrieb Robert LeBlanc:
> Once we put in our cache tier the I/O on the spindles was so low, we
> just moved the journals off the SSDs onto the spindles and left the
> SSD space for cache. There have been testing showing that better
> performance can be achieved by putting more OSDs on an NVMe disk, but
> you also have to balance that with OSDs not being evenly distributed
> so some OSDs will use more space than others.

Hm, maybe it was due to our very small size for the cache (only 540GB in
total, limitted to max-bytes 220 GB as size=2 and your mentioned uneven
distribution) we found that during times where the cache pool flushed to
the storage pool client IO took a severe hit.

> I probably wouldn't go more than 4 100 GB partitions, but it really
> depends on the number of PGs and your data distribution. Also, even
> with all the data in the cache, there is still a performance penalty
> for having the caching tier vs. a native SSD pool. So if you are not
> using the tiering, move to a straight SSD pool.
Yes, I also have the feeling that less than 100 GB per OSD doesn't make
sense. I tend to 3 OSD with about 120GB + a bit for the journals as the
first "draft"-implementation.

Greetings
-Sascha-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] can not umount ceph osd partition

2016-02-04 Thread Max A. Krasilnikov

Hello!

On Thu, Feb 04, 2016 at 11:10:06AM +0100, yoann.moulin wrote:

> Hello,

 I am using 0.94.5. When I try to umount partition and fsck it I have issue:
 root@storage003:~# stop ceph-osd id=13
 ceph-osd stop/waiting
 root@storage003:~# umount /var/lib/ceph/osd/ceph-13
 root@storage003:~# fsck -yf /dev/sdf
 fsck from util-linux 2.20.1
 e2fsck 1.42.9 (4-Feb-2014)
 /dev/sdf is in use.
 e2fsck: Cannot continue, aborting.

 There is no /var/lib/ceph/osd/ceph-13 in /proc mounts. But no ability to 
 check
 fs.
 I can mount -o remount,rw, but I would like to umount device for 
 maintenance
 and, maybe, replace it.

 Why I can't umount?
>> 
>>> is "lsof -n | grep /dev/sdf" give something ?
>> 
>> Nothing.
>> 
>>> and are you sure /dev/sdf is the disk for osd 13 ?
>> 
>> Absolutelly. I have even tried fsck -yf /dev/disk/by-label/osd-13. No luck.
>> 
>> Disk is mounted using LABEL in fstab, journal is symlink to
>> /dev/disk/by-partlabel/j-13.

> I think it's more linux related.

Maybe. But I have it only on ceph boxes :(

> could you try to look with lsof if something hold the device by the
> label or uuid instead of /dev/sdf ?

> you can try to delete the device from the scsi bus with something like :

> echo 1 > /sys/block//device/delete

> be careful, it is like removing the disk physically, if a process holds
> the device, you might expect that process gonna switch into kernel
> status "D+" . You won't be able to kill that process even by kill -9. To
> stop it, you will have to reboot the server.

> you can give a look here how to manipulate scsi bus:

> http://fibrevillage.com/storage/279-hot-add-remove-rescan-of-scsi-devices-on-linux

> you can install the package "scsitools" that provide rescan-scsi-bus.sh
> to rescan you scsi bus to get back your disk removed.

> http://manpages.ubuntu.com/manpages/precise/man8/rescan-scsi-bus.8.html

> hope that can help you

Thanx a lot! I will try to use partx -u (it sometimes helped me in past to
re-read partitions from disk when gdisk was not able to update kernel's list of
partitions) and software removing/inserting drive.
If some processes fails into uninterruptible sleep, I will reboot node. It will
be rebooted in any case if this will not help.

If I investigate thomething it will be posted here. I think, it can affect other
ceph users.

-- 
WBR, Max A. Krasilnikov
ColoCall Data Center
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Optimal OSD count for SSDs / NVMe disks

2016-02-04 Thread Wade Holler

You referenced parallel writes for journal and data. Which is default for
btrfs but but XFS. Now you are mentioning multiple parallel writes to the
drive , which of course yes will occur.

Also Our Dell 400 Gb NVMe drives do not top out around 5-7 sequential
writes as you mentioned. That would be 5-7 random writes from a drives
perspective and the NVMe drives can do many times that.

I would park it at 5-6 partitions per NVMe , journal on the same disk.
Frequently I want more concurrent operations , rather than all out
throughput.
On Thu, Feb 4, 2016 at 6:49 AM Sascha Vogt  wrote:

> Am 03.02.2016 um 17:24 schrieb Wade Holler:
> > AFAIK when using XFS, parallel write as you described is not enabled.
> Not sure I'm getting this. If I have multiple OSDs on the same NVMe
> (separated by different data-partitions) I have multiple parallel writes
> (one "stream" per OSD), or am I mistaken?
>
> > Regardless in a way though the NVMe drives are so fast it shouldn't
> > matter much the partitioned journal or other choice.
> Thanks, does anyone has benchmarks on this. How about the size of the
> journal?
>
> > What I would be more interested in is you replication size on the cache
> > pool.
> >
> > This might sound crazy but if your KVM instances are really that short
> > lived, could you get away with size=2 on the cache pool from and
> > availability perspective ?
> :) We are already on min_size=1, size=2 - we even ran for a while witz
> min_size=1, size=1, so we cannot squeeze out much more on that end.
>
> Greetings
> -Sascha-
>
> PS: Thanks a lot already for all the answers!
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Optimal OSD count for SSDs / NVMe disks

2016-02-04 Thread Sascha Vogt

Am 03.02.2016 um 17:24 schrieb Wade Holler:
> AFAIK when using XFS, parallel write as you described is not enabled.
Not sure I'm getting this. If I have multiple OSDs on the same NVMe
(separated by different data-partitions) I have multiple parallel writes
(one "stream" per OSD), or am I mistaken?

> Regardless in a way though the NVMe drives are so fast it shouldn't
> matter much the partitioned journal or other choice.
Thanks, does anyone has benchmarks on this. How about the size of the
journal?

> What I would be more interested in is you replication size on the cache
> pool.
> 
> This might sound crazy but if your KVM instances are really that short
> lived, could you get away with size=2 on the cache pool from and
> availability perspective ?
:) We are already on min_size=1, size=2 - we even ran for a while witz
min_size=1, size=1, so we cannot squeeze out much more on that end.

Greetings
-Sascha-

PS: Thanks a lot already for all the answers!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] can not umount ceph osd partition

2016-02-04 Thread Yoann Moulin

Hello,

>>> I am using 0.94.5. When I try to umount partition and fsck it I have issue:
>>> root@storage003:~# stop ceph-osd id=13
>>> ceph-osd stop/waiting
>>> root@storage003:~# umount /var/lib/ceph/osd/ceph-13
>>> root@storage003:~# fsck -yf /dev/sdf
>>> fsck from util-linux 2.20.1
>>> e2fsck 1.42.9 (4-Feb-2014)
>>> /dev/sdf is in use.
>>> e2fsck: Cannot continue, aborting.
>>>
>>> There is no /var/lib/ceph/osd/ceph-13 in /proc mounts. But no ability to 
>>> check
>>> fs.
>>> I can mount -o remount,rw, but I would like to umount device for maintenance
>>> and, maybe, replace it.
>>>
>>> Why I can't umount?
> 
>> is "lsof -n | grep /dev/sdf" give something ?
> 
> Nothing.
> 
>> and are you sure /dev/sdf is the disk for osd 13 ?
> 
> Absolutelly. I have even tried fsck -yf /dev/disk/by-label/osd-13. No luck.
> 
> Disk is mounted using LABEL in fstab, journal is symlink to
> /dev/disk/by-partlabel/j-13.

I think it's more linux related.

could you try to look with lsof if something hold the device by the
label or uuid instead of /dev/sdf ?

you can try to delete the device from the scsi bus with something like :

echo 1 > /sys/block//device/delete

be careful, it is like removing the disk physically, if a process holds
the device, you might expect that process gonna switch into kernel
status "D+" . You won't be able to kill that process even by kill -9. To
stop it, you will have to reboot the server.

you can give a look here how to manipulate scsi bus:

http://fibrevillage.com/storage/279-hot-add-remove-rescan-of-scsi-devices-on-linux

you can install the package "scsitools" that provide rescan-scsi-bus.sh
to rescan you scsi bus to get back your disk removed.

http://manpages.ubuntu.com/manpages/precise/man8/rescan-scsi-bus.8.html

hope that can help you

-- 
Yoann Moulin
EPFL IC-IT
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] ceph 9.2.0 mds cluster went down and now constantly crashes with Floating point exception

2016-02-04 Thread Kenneth Waegeman


Hi,

Hi, we are running ceph 9.2.0.
Overnight, our ceph state went to 'mds mds03 is laggy' . When I checked 
the logs, I saw this mds crashed with a stacktrace. I checked the other 
mdss, and I saw the same there.
When I try to start the mds again, I get again a stacktrace and it won't 
come up:


 -12> 2016-02-04 10:23:46.837131 7ff9ea570700  1 -- 
10.141.16.2:6800/193767 <== osd.146 10.141.16.25:6800/7036 1  
osd_op_reply(207 15ef982. [stat] v0'0 uv22184 ondisk = 0) v6 
 187+0+16 (113

2261152 0 506978568) 0x7ffa171ae940 con 0x7ffa189cc3c0
-11> 2016-02-04 10:23:46.837317 7ff9ed6a1700  1 -- 
10.141.16.2:6800/193767 <== osd.136 10.141.16.24:6800/6764 6  
osd_op_reply(209 148aaac. [delete] v0'0 uv23797 ondisk = -2 
((2) No such file o
r directory)) v6  187+0+0 (64699207 0 0) 0x7ffa171acb00 con 
0x7ffa014fd9c0
-10> 2016-02-04 10:23:46.837406 7ff9ec994700  1 -- 
10.141.16.2:6800/193767 <== osd.36 10.141.16.14:6800/5395 5  
osd_op_reply(175 15f631f. [stat] v0'0 uv22466 ondisk = 0) v6 
 187+0+16 (1037

61047 0 2527067705) 0x7ffa08363700 con 0x7ffa189ca580
 -9> 2016-02-04 10:23:46.837463 7ff9eba85700  1 -- 
10.141.16.2:6800/193767 <== osd.47 10.141.16.15:6802/7128 2  
osd_op_reply(211 148aac8. [delete] v0'0 uv22990 ondisk = -2 
((2) No such file or
  directory)) v6  187+0+0 (1138385695 0 0) 0x7ffa01cd0dc0 con 
0x7ffa189cadc0
 -8> 2016-02-04 10:23:46.837468 7ff9eb27d700  1 -- 
10.141.16.2:6800/193767 <== osd.16 10.141.16.12:6800/5739 2  
osd_op_reply(212 148aacd. [delete] v0'0 uv23991 ondisk = -2 
((2) No such file or
  directory)) v6  187+0+0 (1675093742 0 0) 0x7ffa171ac840 con 
0x7ffa189cb760
 -7> 2016-02-04 10:23:46.837477 7ff9eab76700  1 -- 
10.141.16.2:6800/193767 <== osd.66 10.141.16.17:6800/6353 2  
osd_op_reply(210 148aab9. [delete] v0'0 uv24583 ondisk = -2 
((2) No such file or
  directory)) v6  187+0+0 (603192739 0 0) 0x7ffa19054680 con 
0x7ffa189cbce0
 -6> 2016-02-04 10:23:46.838140 7ff9f0bcf700  1 -- 
10.141.16.2:6800/193767 <== osd.2 10.141.16.2:6802/126856 43  
osd_op_reply(121 200.9d96 [write 1459360~980] v943'4092 uv4092 
ondisk = 0) v6  179+0+0 (3939130488 0 0) 0x7ffa01590100 con 
0x7ffa014fab00
 -5> 2016-02-04 10:23:46.838342 7ff9f0bcf700  1 -- 
10.141.16.2:6800/193767 <== osd.2 10.141.16.2:6802/126856 44  
osd_op_reply(124 200.9d96 [write 1460340~956] v943'4093 uv4093 
ondisk = 0) v6  179+0+0 (1434265886 0 0) 0x7ffa01590100 con 
0x7ffa014fab00
 -4> 2016-02-04 10:23:46.838531 7ff9f0bcf700  1 -- 
10.141.16.2:6800/193767 <== osd.2 10.141.16.2:6802/126856 45  
osd_op_reply(126 200.9d96 [write 1461296~954] v943'4094 uv4094 
ondisk = 0) v6  179+0+0 (25292940 0 0) 0x7ffa01590100 con 0x7ffa014fab00
 -3> 2016-02-04 10:23:46.838700 7ff9ecd98700  1 -- 
10.141.16.2:6800/193767 <== osd.57 10.141.16.16:6802/7067 3  
osd_op_reply(199 15ef976. [stat] v0'0 uv22557 ondisk = 0) v6 
 187+0+16 (354652996 0 2244692791) 0x7ffa171ade40 con 0x7ffa189ca160
 -2> 2016-02-04 10:23:46.839301 7ff9ed8a3700  1 -- 
10.141.16.2:6800/193767 <== osd.107 10.141.16.21:6802/7468 3  
osd_op_reply(115 1625476. [stat] v0'0 uv22587 ondisk = 0) v6 
 187+0+16 (664308076 0 998461731) 0x7ffa08363c80 con 0x7ffa014fdb20
 -1> 2016-02-04 10:23:46.839322 7ff9f0bcf700  1 -- 
10.141.16.2:6800/193767 <== osd.2 10.141.16.2:6802/126856 46  
osd_op_reply(128 200.9d96 [write 1462250~954] v943'4095 uv4095 
ondisk = 0) v6  179+0+0 (1379768629 0 0) 0x7ffa01590100 con 
0x7ffa014fab00
  0> 2016-02-04 10:23:46.839379 7ff9f30d8700 -1 *** Caught signal 
(Floating point exception) **

  in thread 7ff9f30d8700

  ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
  1: (()+0x4b6fa2) [0x7ff9fd091fa2]
  2: (()+0xf100) [0x7ff9fbfd3100]
  3: (StrayManager::_calculate_ops_required(CInode*, bool)+0xa2) 
[0x7ff9fcf0adc2]

  4: (StrayManager::enqueue(CDentry*, bool)+0x169) [0x7ff9fcf10459]
  5: (StrayManager::__eval_stray(CDentry*, bool)+0xa49) [0x7ff9fcf111c9]
  6: (StrayManager::eval_stray(CDentry*, bool)+0x1e) [0x7ff9fcf113ce]
  7: (MDCache::scan_stray_dir(dirfrag_t)+0x13d) [0x7ff9fce6741d]
  8: (MDSInternalContextBase::complete(int)+0x1e3) [0x7ff9fcff4993]
  9: (MDSRank::_advance_queues()+0x382) [0x7ff9fcdd4652]
  10: (MDSRank::ProgressThread::entry()+0x4a) [0x7ff9fcdd4aca]
  11: (()+0x7dc5) [0x7ff9fbfcbdc5]
  12: (clone()+0x6d) [0x7ff9faeb621d]

Does someone has an idea? We can't use our fs right now..

I included the full log of an mds start in attachment

Thanks!!

K



mds02.tar.bz2
Description: application/bzip
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] mds0: Client X failing to respond to capability release

2016-02-04 Thread Michael Metz-Martini | SpeedPartner GmbH

Hi,

Am 04.02.2016 um 09:43 schrieb Yan, Zheng:
> On Thu, Feb 4, 2016 at 4:36 PM, Michael Metz-Martini | SpeedPartner
> GmbH  wrote:
>> Am 03.02.2016 um 15:55 schrieb Yan, Zheng:
 On Feb 3, 2016, at 21:50, Michael Metz-Martini | SpeedPartner GmbH 
  wrote:
 Am 03.02.2016 um 12:11 schrieb Yan, Zheng:
>> On Feb 3, 2016, at 17:39, Michael Metz-Martini | SpeedPartner GmbH 
>>  wrote:
>> Am 03.02.2016 um 10:26 schrieb Gregory Farnum:
>>> On Tue, Feb 2, 2016 at 10:09 PM, Michael Metz-Martini | SpeedPartner
 2016-02-03 14:42:25.581840 7fadfd280700  0 log_channel(default) log
 [WRN] : 7 slow requests, 6 included below; oldest blocked for >
 62.125785 secs
 2016-02-03 14:42:25.581849 7fadfd280700  0 log_channel(default) log
 [WRN] : slow request 62.125785 seconds old, received at 2016-02-03
 14:41:23.455812: client_request(client.10199855:1313157 getattr
 pAsLsXsFs #100815bd349 2016-02-03 14:41:23.452386) currently failed to
 rdlock, waiting
>>>
>>> This seems like dirty page writeback is too slow.  Is there any hung OSD 
>>> request in /sys/kernel/debug/ceph/xxx/osdc?
>> Where should I check? Client or mds? Do I have to enable something to
>> get this details? Directory /sys/kernel/debug/ceph/ seems to be missing.
> On client with kernel ceph mount. If there is no debugfs, mount
> debugfs first (mount -t debugfs /sys/kernel/debug /sys/kernel/debug)
Got it. http://www.michael-metz.de/osdc.txt.gz (about 500kb uncompressed)

By looking around I found caps .. $ cat caps
total   305975
avail   2
used305973
reserved0
min 1024

Somehow related? avail=2 is low ;-)

-- 
Kind regards
 Michael Metz-Martini
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] mds0: Client X failing to respond to capability release

2016-02-04 Thread Yan, Zheng

On Thu, Feb 4, 2016 at 4:36 PM, Michael Metz-Martini | SpeedPartner
GmbH  wrote:
> Hi,
>
> Am 03.02.2016 um 15:55 schrieb Yan, Zheng:
>>> On Feb 3, 2016, at 21:50, Michael Metz-Martini | SpeedPartner GmbH 
>>>  wrote:
>>> Am 03.02.2016 um 12:11 schrieb Yan, Zheng:
> On Feb 3, 2016, at 17:39, Michael Metz-Martini | SpeedPartner GmbH 
>  wrote:
> Am 03.02.2016 um 10:26 schrieb Gregory Farnum:
>> On Tue, Feb 2, 2016 at 10:09 PM, Michael Metz-Martini | SpeedPartner
>>> 2016-02-03 14:42:25.581840 7fadfd280700  0 log_channel(default) log
>>> [WRN] : 7 slow requests, 6 included below; oldest blocked for >
>>> 62.125785 secs
>>> 2016-02-03 14:42:25.581849 7fadfd280700  0 log_channel(default) log
>>> [WRN] : slow request 62.125785 seconds old, received at 2016-02-03
>>> 14:41:23.455812: client_request(client.10199855:1313157 getattr
>>> pAsLsXsFs #100815bd349 2016-02-03 14:41:23.452386) currently failed to
>>> rdlock, waiting
>>
>> This seems like dirty page writeback is too slow.  Is there any hung OSD 
>> request in /sys/kernel/debug/ceph/xxx/osdc?
> Where should I check? Client or mds? Do I have to enable something to
> get this details? Directory /sys/kernel/debug/ceph/ seems to be missing.
>

On client with kernel ceph mount. If there is no debugfs, mount
debugfs first (mount -t debugfs /sys/kernel/debug /sys/kernel/debug)

Regards
Yan, Zheng

>
> --
> Kind regards
>  Michael Metz-Martini
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] mds0: Client X failing to respond to capability release

2016-02-04 Thread Michael Metz-Martini | SpeedPartner GmbH

Hi,

Am 03.02.2016 um 15:55 schrieb Yan, Zheng:
>> On Feb 3, 2016, at 21:50, Michael Metz-Martini | SpeedPartner GmbH 
>>  wrote:
>> Am 03.02.2016 um 12:11 schrieb Yan, Zheng:
 On Feb 3, 2016, at 17:39, Michael Metz-Martini | SpeedPartner GmbH 
  wrote:
 Am 03.02.2016 um 10:26 schrieb Gregory Farnum:
> On Tue, Feb 2, 2016 at 10:09 PM, Michael Metz-Martini | SpeedPartner
>> 2016-02-03 14:42:25.581840 7fadfd280700  0 log_channel(default) log
>> [WRN] : 7 slow requests, 6 included below; oldest blocked for >
>> 62.125785 secs
>> 2016-02-03 14:42:25.581849 7fadfd280700  0 log_channel(default) log
>> [WRN] : slow request 62.125785 seconds old, received at 2016-02-03
>> 14:41:23.455812: client_request(client.10199855:1313157 getattr
>> pAsLsXsFs #100815bd349 2016-02-03 14:41:23.452386) currently failed to
>> rdlock, waiting
> 
> This seems like dirty page writeback is too slow.  Is there any hung OSD 
> request in /sys/kernel/debug/ceph/xxx/osdc?
Where should I check? Client or mds? Do I have to enable something to
get this details? Directory /sys/kernel/debug/ceph/ seems to be missing.


-- 
Kind regards
 Michael Metz-Martini
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] hammer-0.94.5 + kernel-4.1.15 - cephfs stuck

2016-02-04 Thread Nikola Ciprich



On 4 February 2016 08:33:55 CET, Gregory Farnum  wrote:
>The quick and dirty cleanup is to restart the OSDs hosting those PGs.
>They might have gotten some stuck ops which didn't get woken up; a few
>bugs like that have gone by and are resolved in various stable
>branches (I'm not sure what release binaries they're in).
>
That's what I thought, so I tried restarting all OSD already.. But those stuck 
PGs still remain
The version I'm running is 0.94.5

Nik


>On Wed, Feb 3, 2016 at 11:32 PM, Nikola Ciprich
> wrote:
>>> Yeah, these inactive PGs are basically guaranteed to be the cause of
>>> the problem. There are lots of threads about getting PGs healthy
>>> again; you should dig around the archives and the documentation
>>> troubleshooting page(s). :)
>>> -Greg
>>
>> Hello Gregory,
>>
>> well, I wouldn't doubt it, but when the problems started, the only
>> unclean pages were some remapped, no inactive, so I guess it must've
>> been something else..
>>
>> but I'm now struggling to get rid of those inactive of course..
>> however I've not been successfull so far, I've probably read all
>> the related docs and discussions and still haven't found similar
>> problem..
>>
>> pg 6.11 is stuck stale for 79285.647847, current state
>stale+active+clean, last acting [4,10,8]
>> pg 3.198 is stuck stale for 79367.532437, current state
>stale+active+clean, last acting [8,13]
>>
>> those two are stale for some reason.. but OSDS 4, 8, 10, 13 are
>running, there
>> are no network problems.. PG query on those just hangs..
>>
>> I'm running ot of ideas here..
>>
>> nik
>>
>>
>> --
>> -
>> Ing. Nikola CIPRICH
>> LinuxBox.cz, s.r.o.
>> 28. rijna 168, 709 00 Ostrava
>>
>> tel.:   +420 591 166 214
>> fax:+420 596 621 273
>> mobil:  +420 777 093 799
>>
>> www.linuxbox.cz
>>
>> mobil servis: +420 737 238 656
>> email servis: ser...@linuxbox.cz
>> -

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

37 matches

Mail list logo