Re: [ceph-users] blocked ops

2016-08-11 Thread Brad Hubbard
On Thu, Aug 11, 2016 at 11:33:29PM +0100, Roeland Mertens wrote:
> Hi,
> 
> I was hoping someone on this list may be able to help?
> 
> We're running a 35 node 10.2.1 cluster with 595 OSDs. For the last 12 hours
> we've been plagued with blocked requests which completely kills the
> performance of the cluster
> 
> # ceph health detail
> HEALTH_ERR 1 pgs are stuck inactive for more than 300 seconds; 1 pgs down; 1
> pgs peering; 1 pgs stuck inactive; 100 requests are blocked > 32 sec; 1 osds
> have slow requests; noout,nodeep-scrub,sortbitwise flag(s) set
> pg 63.1a18 is stuck inactive for 135133.509820, current state
> down+remapped+peering, last acting 
> [2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,235,148,290,300,147,157,370]

That value (2147483647) is defined in src/crush/crush.h like so;

#define CRUSH_ITEM_NONE   0x7fff  /* no result */

So this could be due to a bad crush rule or maybe choose_total_tries needs to
be higher?

$ ceph osd crush rule ls

For each rule listed by the above command.

$ ceph osd crush rule dump [rule_name]

I'd then dump out the crushmap and test it showing any bad mappings with the
commands listed here;

http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#crush-gives-up-too-soon

I'd also check the pg numbers for your pool(s) are appropriate as not enough
pgs could also be a contributing factor IIRC.

That should hopefully give some insight.

-- 
HTH,
Brad

> pg 63.1a18 is down+remapped+peering, acting 
> [2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,235,148,290,300,147,157,370]
> 100 ops are blocked > 2097.15 sec on osd.4
> 1 osds have slow requests
> noout,nodeep-scrub,sortbitwise flag(s) set
> 
> the one pg down is due to us running into an odd EC issue which I mailed the
> list about earlier, it's the 100 blocked ops that are puzzling us. If we out
> the osd in question, they just shift to another osd (on a different host!).
> We even tried rebooting the node it's on but to little avail.
> 
> We get a ton of log messages like this:
> 
> 2016-08-11 23:32:10.041174 7fc668d9f700  0 log_channel(cluster) log [WRN] :
> 100 slow requests, 5 included below; oldest blocked for > 139.313915 secs
> 2016-08-11 23:32:10.041184 7fc668d9f700  0 log_channel(cluster) log [WRN] :
> slow request 139.267004 seconds old, received at 2016-08-11 23:29:50.774091:
> osd_op(client.9192464.0:485640 66.b96c3a18
> default.4282484.42_442fac8195c63a2e19c3c4bb91e8800e [getxattrs,stat,read
> 0~524288] snapc 0=[] RETRY=36 ack+retry+read+known_if_redirected e50109)
> currently waiting for blocked object
> 2016-08-11 23:32:10.041189 7fc668d9f700  0 log_channel(cluster) log [WRN] :
> slow request 139.244839 seconds old, received at 2016-08-11 23:29:50.796256:
> osd_op(client.9192464.0:596033 66.942a5a18
> default.4282484.30__shadow_.sLkZ_rUX6cvi0ifFasw1UipEIuFPzYB_6 [write
> 1048576~524288] snapc 0=[] RETRY=36
> ack+ondisk+retry+write+known_if_redirected e50109) currently waiting for
> blocked object
> 
> A dump of the blocked ops tells us very little , is there anyone who can
> shed some light on this? Or at least give us a hint on how we can fix this?
> 
> # ceph daemon osd.4 dump_blocked_ops
> 
> 
>{
> "description": "osd_op(client.9192464.0:596030 66.942a5a18
> default.4282484.30__shadow_.sLkZ_rUX6cvi0ifFasw1UipEIuFPzYB_6 [writefull
> 0~0] snapc 0=[] RETRY=32 ack+ondisk+retry+write+known_if_redirected
> e50092)",
> "initiated_at": "2016-08-11 22:58:09.721027",
> "age": 1515.105186,
> "duration": 1515.113255,
> "type_data": [
> "reached pg",
> {
> "client": "client.9192464",
> "tid": 596030
> },
> [
> {
> "time": "2016-08-11 22:58:09.721027",
> "event": "initiated"
> },
> {
> "time": "2016-08-11 22:58:09.721066",
> "event": "waiting_for_map not empty"
> },
> {
> "time": "2016-08-11 22:58:09.813574",
> "event": "reached_pg"
> },
> {
> "time": "2016-08-11 22:58:09.813581",
> "event": "waiting for peered"
> },
> {
> "time": "2016-08-11 22:58:09.852796",
> "event": "reached_pg"
> },
> {
> "time": "2016-08-11 22:58:09.852804",
> "event": "waiting for peered"
> },
> {
> "time": "2016-08-11 22:58:10.876636",
> "event": "reached_pg"
> },
> {
>  

[ceph-users] blocked ops

2016-08-11 Thread Roeland Mertens

Hi,

I was hoping someone on this list may be able to help?

We're running a 35 node 10.2.1 cluster with 595 OSDs. For the last 12 
hours we've been plagued with blocked requests which completely kills 
the performance of the cluster


# ceph health detail
HEALTH_ERR 1 pgs are stuck inactive for more than 300 seconds; 1 pgs 
down; 1 pgs peering; 1 pgs stuck inactive; 100 requests are blocked > 32 
sec; 1 osds have slow requests; noout,nodeep-scrub,sortbitwise flag(s) set
pg 63.1a18 is stuck inactive for 135133.509820, current state 
down+remapped+peering, last acting 
[2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,235,148,290,300,147,157,370]
pg 63.1a18 is down+remapped+peering, acting 
[2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,235,148,290,300,147,157,370]

100 ops are blocked > 2097.15 sec on osd.4
1 osds have slow requests
noout,nodeep-scrub,sortbitwise flag(s) set

the one pg down is due to us running into an odd EC issue which I mailed 
the list about earlier, it's the 100 blocked ops that are puzzling us. 
If we out the osd in question, they just shift to another osd (on a 
different host!). We even tried rebooting the node it's on but to little 
avail.


We get a ton of log messages like this:

2016-08-11 23:32:10.041174 7fc668d9f700  0 log_channel(cluster) log 
[WRN] : 100 slow requests, 5 included below; oldest blocked for > 
139.313915 secs
2016-08-11 23:32:10.041184 7fc668d9f700  0 log_channel(cluster) log 
[WRN] : slow request 139.267004 seconds old, received at 2016-08-11 
23:29:50.774091: osd_op(client.9192464.0:485640 66.b96c3a18 
default.4282484.42_442fac8195c63a2e19c3c4bb91e8800e [getxattrs,stat,read 
0~524288] snapc 0=[] RETRY=36 ack+retry+read+known_if_redirected e50109) 
currently waiting for blocked object
2016-08-11 23:32:10.041189 7fc668d9f700  0 log_channel(cluster) log 
[WRN] : slow request 139.244839 seconds old, received at 2016-08-11 
23:29:50.796256: osd_op(client.9192464.0:596033 66.942a5a18 
default.4282484.30__shadow_.sLkZ_rUX6cvi0ifFasw1UipEIuFPzYB_6 [write 
1048576~524288] snapc 0=[] RETRY=36 
ack+ondisk+retry+write+known_if_redirected e50109) currently waiting for 
blocked object


A dump of the blocked ops tells us very little , is there anyone who can 
shed some light on this? Or at least give us a hint on how we can fix this?


# ceph daemon osd.4 dump_blocked_ops


   {
"description": "osd_op(client.9192464.0:596030 66.942a5a18 
default.4282484.30__shadow_.sLkZ_rUX6cvi0ifFasw1UipEIuFPzYB_6 [writefull 
0~0] snapc 0=[] RETRY=32 ack+ondisk+retry+write+known_if_redirected 
e50092)",

"initiated_at": "2016-08-11 22:58:09.721027",
"age": 1515.105186,
"duration": 1515.113255,
"type_data": [
"reached pg",
{
"client": "client.9192464",
"tid": 596030
},
[
{
"time": "2016-08-11 22:58:09.721027",
"event": "initiated"
},
{
"time": "2016-08-11 22:58:09.721066",
"event": "waiting_for_map not empty"
},
{
"time": "2016-08-11 22:58:09.813574",
"event": "reached_pg"
},
{
"time": "2016-08-11 22:58:09.813581",
"event": "waiting for peered"
},
{
"time": "2016-08-11 22:58:09.852796",
"event": "reached_pg"
},
{
"time": "2016-08-11 22:58:09.852804",
"event": "waiting for peered"
},
{
"time": "2016-08-11 22:58:10.876636",
"event": "reached_pg"
},
{
"time": "2016-08-11 22:58:10.876640",
"event": "waiting for peered"
},
{
"time": "2016-08-11 22:58:10.902760",
"event": "reached_pg"
}
]
]
}
...


Kind regards,


Roeland


--
This email is sent on behalf of Genomics plc, a public limited company 
registered in England and Wales with registered number 8839972, VAT 
registered number 189 2635 65 and registered office at King Charles House, 
Park End Street, Oxford, OX1 1JD, United Kingdom.
The contents of this e-mail and any attachments are confidential to the 
intended recipient. If you are not the intended recipient please do not use 
or publish its contents, contact Genomics plc immediately at 
i...@genomicsplc.com  then delete. You may not copy, 

Re: [ceph-users] Backfilling pgs not making progress

2016-08-11 Thread Samuel Just
I just updated the bug with several questions.
-Sam

On Thu, Aug 11, 2016 at 6:56 AM, Brian Felton  wrote:
> Sam,
>
> I very much appreciate the assistance.  I have opened
> http://tracker.ceph.com/issues/16997 to track this (potential) issue.
>
> Brian
>
> On Wed, Aug 10, 2016 at 1:53 PM, Samuel Just  wrote:
>>
>> Ok, can you
>> 1) Open a bug
>> 2) Identify all osds involved in the 5 problem pgs
>> 3) enable debug osd = 20, debug filestore = 20, debug ms = 1 on all of
>> them
>> 4) mark the primary for each pg down (should cause peering and
>> backfill to restart)
>> 5) link all logs to the bug
>>
>> Thanks!
>> -Sam
>>
>> On Tue, Jul 26, 2016 at 9:11 AM, Samuel Just  wrote:
>> > Hmm, nvm, it's not an lfn object anyway.
>> > -Sam
>> >
>> > On Tue, Jul 26, 2016 at 7:07 AM, Brian Felton 
>> > wrote:
>> >> If I search on osd.580, I find
>> >>
>> >> default.421929.15\uTEPP\s84316222-6ddd-4ac9-8283-6fa1cdcf9b88\sbackups\s20160630091353\sp1\s\sShares\sWarehouse\sLondonWarehouse\sLondon\sRon
>> >> picture's\sMISCELLANEOUS\s2014\sOct., 2014\sOct.
>> >> 1\sDSC04329.JPG__head_981926C1__21__5, which has a
>> >> non-zero
>> >> size and a hash (981926C1) that matches that of the same file found on
>> >> the
>> >> other OSDs in the pg.
>> >>
>> >> If I'm misunderstanding what you're asking about a dangling link,
>> >> please
>> >> point me in the right direction.
>> >>
>> >> Brian
>> >>
>> >> On Tue, Jul 26, 2016 at 8:59 AM, Samuel Just  wrote:
>> >>>
>> >>> Did you also confirm that the backfill target does not have any of
>> >>> those dangling links?  I'd be looking for a dangling link for
>> >>>
>> >>>
>> >>> 981926c1/default.421929.15_TEPP/84316222-6ddd-4ac9-8283-6fa1cdcf9b88/backups/20160630091353/p1//Shares/Warehouse/LondonWarehouse/London/Ron
>> >>> picture's/MISCELLANEOUS/2014/Oct., 2014/Oct. 1/DSC04329.JPG/head//33
>> >>> on osd.580.
>> >>> -Sam
>> >>>
>> >>> On Mon, Jul 25, 2016 at 9:04 PM, Brian Felton 
>> >>> wrote:
>> >>> > Sam,
>> >>> >
>> >>> > I cranked up the logging on the backfill target (osd 580 on node 07)
>> >>> > and
>> >>> > the
>> >>> > acting primary for the pg (453 on node 08, for what it's worth).
>> >>> > The
>> >>> > logs
>> >>> > from the primary are very large, so pardon the tarballs.
>> >>> >
>> >>> > PG Primary Logs:
>> >>> >
>> >>> > https://www.dropbox.com/s/ipjobn2i5ban9km/backfill-primary-log.tgz?dl=0B
>> >>> > PG Backfill Target Logs:
>> >>> >
>> >>> > https://www.dropbox.com/s/9qpiqsnahx0qc5k/backfill-target-log.tgz?dl=0
>> >>> >
>> >>> > I'll be reviewing them with my team tomorrow morning to see if we
>> >>> > can
>> >>> > find
>> >>> > anything.  Thanks for your assistance.
>> >>> >
>> >>> > Brian
>> >>> >
>> >>> > On Mon, Jul 25, 2016 at 3:33 PM, Samuel Just 
>> >>> > wrote:
>> >>> >>
>> >>> >> The next thing I'd want is for you to reproduce with
>> >>> >>
>> >>> >> debug osd = 20
>> >>> >> debug filestore = 20
>> >>> >> debug ms = 1
>> >>> >>
>> >>> >> and post the file somewhere.
>> >>> >> -Sam
>> >>> >>
>> >>> >> On Mon, Jul 25, 2016 at 1:33 PM, Samuel Just 
>> >>> >> wrote:
>> >>> >> > If you don't have the orphaned file link, it's not the same bug.
>> >>> >> > -Sam
>> >>> >> >
>> >>> >> > On Mon, Jul 25, 2016 at 12:55 PM, Brian Felton
>> >>> >> > 
>> >>> >> > wrote:
>> >>> >> >> Sam,
>> >>> >> >>
>> >>> >> >> I'm reviewing that thread now, but I'm not seeing a lot of
>> >>> >> >> overlap
>> >>> >> >> with
>> >>> >> >> my
>> >>> >> >> cluster's situation.  For one, I am unable to start either a
>> >>> >> >> repair
>> >>> >> >> or
>> >>> >> >> a
>> >>> >> >> deep scrub on any of the affected pgs.  I've instructed all six
>> >>> >> >> of
>> >>> >> >> the
>> >>> >> >> pgs
>> >>> >> >> to scrub, deep-scrub, and repair, and the cluster has been
>> >>> >> >> gleefully
>> >>> >> >> ignoring these requests (it has been several hours since I first
>> >>> >> >> tried,
>> >>> >> >> and
>> >>> >> >> the logs indicate none of the pgs ever scrubbed).  Second, none
>> >>> >> >> of
>> >>> >> >> the
>> >>> >> >> my
>> >>> >> >> OSDs is crashing.  Third, none of my pgs or objects has ever
>> >>> >> >> been
>> >>> >> >> marked
>> >>> >> >> inconsistent (or unfound, for that matter) -- I'm only seeing
>> >>> >> >> the
>> >>> >> >> standard
>> >>> >> >> mix of degraded/misplaced objects that are common during a
>> >>> >> >> recovery.
>> >>> >> >> What
>> >>> >> >> I'm not seeing is any further progress on the number of
>> >>> >> >> misplaced
>> >>> >> >> objects --
>> >>> >> >> the number has remained effectively unchanged for the past
>> >>> >> >> several
>> >>> >> >> days.
>> >>> >> >>
>> >>> >> >> To be sure, though, I tracked down the file that the backfill
>> >>> >> >> operation
>> >>> >> >> seems to be hung on, and I can find it in both the backfill
>> >>> >> >> target
>> >>> >> >> osd
>> >>> >> >> (580)

Re: [ceph-users] rbd-nbd kernel requirements

2016-08-11 Thread Shawn Edwards
Fair enough.

On Thu, Aug 11, 2016, 10:45 Jason Dillaman  wrote:

> I don't think anyone has really looked into the cause yet, so it's
> hard to say where the problem lies.
>
> On Thu, Aug 11, 2016 at 9:36 AM, Shawn Edwards 
> wrote:
> > Is it thought that this bug is in Ceph and not the kernel?
> >
> >
> > On Thu, Aug 11, 2016 at 8:14 AM Jason Dillaman 
> wrote:
> >>
> >> At this point, we only have automated tests that exercise it against
> >> stock Ubuntu Trusty but that will eventually expand to Xenial once we
> >> get our lab configured for it.  There is one known issue right now
> >> where the kernel can deadlock while starting the test [1] and
> >> mapping/unmapping the device.
> >>
> >> Red Hat Enterprise Linux has the nbd driver disabled so we don't test
> >> against those variants. I've personally run it on Fedora 23 and 24
> >> during manual testing without any apparent issues.
> >>
> >> [1] http://tracker.ceph.com/issues/16921
> >>
> >> On Wed, Aug 10, 2016 at 7:16 PM, Shawn Edwards 
> >> wrote:
> >> > Is there a minimum kernel version required for rbd-nbd to work and
> work
> >> > well?  Before I start stress testing it, I want to be sure I have a
> >> > system
> >> > that is expected to work.
> >> >
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >
> >>
> >>
> >>
> >> --
> >> Jason
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
>
> --
> Jason
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now

2016-08-11 Thread Wido den Hollander

> Op 11 augustus 2016 om 15:17 schreef Sean Sullivan :
> 
> 
> Hello Wido,
> 
> Thanks for the advice.  While the data center has a/b circuits and
> redundant power, etc if a ground fault happens it  travels outside and
> fails causing the whole building to fail (apparently).
> 
> The monitors are each the same with
> 2x e5 cpus
> 64gb of ram
> 4x 300gb 10k SAS drives in raid 10 (write through mode).
> Ubuntu 14.04 with the latest updates prior to power failure (2016/Aug/10 -
> 3am CST)
> Ceph hammer LTS 0.94.7
> 
> (we are still working on our jewel test cluster so it is planned but not in
> place yet)
> 
> The only thing that seems to be corrupt is the monitors leveldb store.  I
> see multiple issues on Google leveldb github from March 2016 about fsync
> and power failure so I assume this is an issue with leveldb.
> 
> I have backed up /var/lib/ceph/Mon on all of my monitors before trying to
> proceed with any form of recovery.
> 
> Is there any way to reconstruct the leveldb or replace the monitors and
> recover the data?
> 
I don't know. I have never done it. Other people might know this better than me.

Maybe 'ceph-monstore-tool' can help you?

Wido

> I found the following post in which sage says it is tedious but possible. (
> http://www.spinics.net/lists/ceph-devel/msg06662.html). Tedious is fine if
> I have any chance of doing it.  I have the fsid, the Mon key map and all of
> the osds look to be fine so all of the previous osd maps  are there.
> 
> I just don't understand what key/values I need inside.
> 
> On Aug 11, 2016 1:33 AM, "Wido den Hollander"  wrote:
> 
> >
> > > Op 11 augustus 2016 om 0:10 schreef Sean Sullivan <
> > seapasu...@uchicago.edu>:
> > >
> > >
> > > I think it just got worse::
> > >
> > > all three monitors on my other cluster say that ceph-mon can't open
> > > /var/lib/ceph/mon/$(hostname). Is there any way to recover if you lose
> > all
> > > 3 monitors? I saw a post by Sage saying that the data can be recovered as
> > > all of the data is held on other servers. Is this possible? If so has
> > > anyone had any experience doing so?
> >
> > I have never done so, so I couldn't tell you.
> >
> > However, it is weird that on all three it got corrupted. What hardware are
> > you using? Was it properly protected against power failure?
> >
> > If you mon store is corrupted I'm not sure what might happen.
> >
> > However, make a backup of ALL monitors right now before doing anything.
> >
> > Wido
> >
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd-nbd kernel requirements

2016-08-11 Thread Jason Dillaman
I don't think anyone has really looked into the cause yet, so it's
hard to say where the problem lies.

On Thu, Aug 11, 2016 at 9:36 AM, Shawn Edwards  wrote:
> Is it thought that this bug is in Ceph and not the kernel?
>
>
> On Thu, Aug 11, 2016 at 8:14 AM Jason Dillaman  wrote:
>>
>> At this point, we only have automated tests that exercise it against
>> stock Ubuntu Trusty but that will eventually expand to Xenial once we
>> get our lab configured for it.  There is one known issue right now
>> where the kernel can deadlock while starting the test [1] and
>> mapping/unmapping the device.
>>
>> Red Hat Enterprise Linux has the nbd driver disabled so we don't test
>> against those variants. I've personally run it on Fedora 23 and 24
>> during manual testing without any apparent issues.
>>
>> [1] http://tracker.ceph.com/issues/16921
>>
>> On Wed, Aug 10, 2016 at 7:16 PM, Shawn Edwards 
>> wrote:
>> > Is there a minimum kernel version required for rbd-nbd to work and work
>> > well?  Before I start stress testing it, I want to be sure I have a
>> > system
>> > that is expected to work.
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>>
>>
>>
>> --
>> Jason
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Include mon restart in logrotate?

2016-08-11 Thread Christian Balzer

Hello,

see below.

On Thu, 11 Aug 2016 12:52:32 +0200 (CEST) Wido den Hollander wrote:

> 
> > Op 11 augustus 2016 om 10:18 schreef Eugen Block :
> > 
> > 
> > Thanks for the really quick response!
> > 
> > > Warning! These are not your regular log files.
> > 
> > Thanks for the warning!
> > 
> > > You shouldn't have to worry about that. The MONs should compact and  
> > > rotate those logs themselve.
> > 
> > I believe the compaction works fine, but these large LOG files just  
> > grow until mon restart. Is there no way to limit the size to a desired  
> > value or anything similar?
> > 
> 
> That's not good. That shouldn't happen. The monitor has to trim these logs as 
> well.
> 
> How big is your mon store?
> 
> $ du -sh /var/lib/ceph/mon/*
> 
> > > What version of Ceph are you running exactly?
> > 
> > ceph@node1:~/ceph-deploy> ceph --version
> > ceph version 0.94.6-75
> > 
> 
> 0.94.7 is already out, might be worth upgrading. Release Notes don't tell 
> anything about this case though.
> 

0.94.5 definitely has that bug (no compaction on either MON or OSD
leveldbs) and I remember a tracker and release note entry about that.

And 0.94.7 definitely doesn't have that problem.
While 0.94.6 has the potential to eat all your data with cache-tiering.

So yeah, upgrade time.

Christian

> > > What is the output of ceph -s?
> > 
> > ceph@node1:~/ceph-deploy> ceph -s
> >  cluster 655cb05a-435a-41ba-83d9-8549f7c36167
> >   health HEALTH_OK
> >   monmap e7: 3 mons at  
> > {mon1=192.168.160.15:6789/0,mon2=192.168.160.17:6789/0,mon3=192.168.160.16:6789/0}
> >  election epoch 242, quorum 0,1,2 mon1,mon2,mon3
> >   osdmap e2377: 19 osds: 19 up, 19 in
> >pgmap v3791457: 4336 pgs, 14 pools, 1551 GB data, 234 kobjects
> >  3223 GB used, 4929 GB / 8153 GB avail
> >  4336 active+clean
> >client io 0 B/s rd, 72112 B/s wr, 7 op/s
> > 
> 
> Ok, that's good. Monitors don't trim the logs when the cluster isn't healthy, 
> but yours is.
> 
> Wido
> 
> > 
> > Zitat von Wido den Hollander :
> > 
> > >> Op 11 augustus 2016 om 9:56 schreef Eugen Block :
> > >>
> > >>
> > >> Hi list,
> > >>
> > >> we have a working cluster based on Hammer with 4 nodes, 19 OSDs and 3 
> > >> MONs.
> > >> Now after a couple of weeks we noticed that we're running out of disk
> > >> space on one of the nodes in /var.
> > >> Similar to [1] there are two large LOG files in
> > >> /var/lib/ceph/mon/ceph-d/store.db/ and I already figured they are
> > >> managed when the respective MON is restarted. But the MONs are not
> > >> restarted regularly so the log files can grow for months and fill up
> > >> the file system.
> > >>
> > >
> > > Warning! These are not your regular log files. They are binary logs  
> > > of LevelDB which are mandatory for the MONs to work!
> > >
> > >> I was thinking about adding another file in /etc/logrotate.d/ and
> > >> trigger a monitor restart once a week. But I'm not sure if it's
> > >> recommended to restart all MONs at the same time, which could happen
> > >> if someone started logrotate manually.
> > >> So my question is, how do you guys manage that and how is it supposed
> > >> to be handled? I'd really appreciate any insights!
> > >>
> > > You shouldn't have to worry about that. The MONs should compact and  
> > > rotate those logs themselve.
> > >
> > > They compact their store on start, so that works for you, but they  
> > > should do this while running.
> > >
> > > What version of Ceph are you running exactly?
> > >
> > > What is the output of ceph -s? MONs usually only compact when the  
> > > cluster is healthy.
> > >
> > > Wido
> > >
> > >> Regards,
> > >> Eugen
> > >>
> > >> [1]
> > >> http://ceph-users.ceph.narkive.com/PBL3kuhq/large-log-like-files-on-monitor
> > >>
> > >> --
> > >> Eugen Block voice   : +49-40-559 51 75
> > >> NDE Netzdesign und -entwicklung AG  fax : +49-40-559 51 77
> > >> Postfach 61 03 15
> > >> D-22423 Hamburg e-mail  : ebl...@nde.ag
> > >>
> > >>  Vorsitzende des Aufsichtsrates: Angelika Mozdzen
> > >>Sitz und Registergericht: Hamburg, HRB 90934
> > >>Vorstand: Jens-U. Mozdzen
> > >> USt-IdNr. DE 814 013 983
> > >>
> > >> ___
> > >> ceph-users mailing list
> > >> ceph-users@lists.ceph.com
> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> > 
> > 
> > -- 
> > Eugen Block voice   : +49-40-559 51 75
> > NDE Netzdesign und -entwicklung AG  fax : +49-40-559 51 77
> > Postfach 61 03 15
> > D-22423 Hamburg e-mail  : ebl...@nde.ag
> > 
> >  Vorsitzende des Aufsichtsrates: Angelika Mozdzen
> >Sitz und Registergericht: Hamburg, HRB 90934
> >Vorstand: Jens-U. Mozdzen
> > USt-IdNr. DE 814 

Re: [ceph-users] Backfilling pgs not making progress

2016-08-11 Thread Brian Felton
Sam,

I very much appreciate the assistance.  I have opened
http://tracker.ceph.com/issues/16997 to track this (potential) issue.

Brian

On Wed, Aug 10, 2016 at 1:53 PM, Samuel Just  wrote:

> Ok, can you
> 1) Open a bug
> 2) Identify all osds involved in the 5 problem pgs
> 3) enable debug osd = 20, debug filestore = 20, debug ms = 1 on all of them
> 4) mark the primary for each pg down (should cause peering and
> backfill to restart)
> 5) link all logs to the bug
>
> Thanks!
> -Sam
>
> On Tue, Jul 26, 2016 at 9:11 AM, Samuel Just  wrote:
> > Hmm, nvm, it's not an lfn object anyway.
> > -Sam
> >
> > On Tue, Jul 26, 2016 at 7:07 AM, Brian Felton 
> wrote:
> >> If I search on osd.580, I find
> >> default.421929.15\uTEPP\s84316222-6ddd-4ac9-8283-6fa1cdcf9b88\sbackups\
> s20160630091353\sp1\s\sShares\sWarehouse\sLondonWarehouse\sLondon\sRon
> >> picture's\sMISCELLANEOUS\s2014\sOct., 2014\sOct.
> >> 1\sDSC04329.JPG__head_981926C1__21__5, which has a
> non-zero
> >> size and a hash (981926C1) that matches that of the same file found on
> the
> >> other OSDs in the pg.
> >>
> >> If I'm misunderstanding what you're asking about a dangling link, please
> >> point me in the right direction.
> >>
> >> Brian
> >>
> >> On Tue, Jul 26, 2016 at 8:59 AM, Samuel Just  wrote:
> >>>
> >>> Did you also confirm that the backfill target does not have any of
> >>> those dangling links?  I'd be looking for a dangling link for
> >>>
> >>> 981926c1/default.421929.15_TEPP/84316222-6ddd-4ac9-8283-
> 6fa1cdcf9b88/backups/20160630091353/p1//Shares/Warehouse/LondonWarehouse/
> London/Ron
> >>> picture's/MISCELLANEOUS/2014/Oct., 2014/Oct. 1/DSC04329.JPG/head//33
> >>> on osd.580.
> >>> -Sam
> >>>
> >>> On Mon, Jul 25, 2016 at 9:04 PM, Brian Felton 
> wrote:
> >>> > Sam,
> >>> >
> >>> > I cranked up the logging on the backfill target (osd 580 on node 07)
> and
> >>> > the
> >>> > acting primary for the pg (453 on node 08, for what it's worth).  The
> >>> > logs
> >>> > from the primary are very large, so pardon the tarballs.
> >>> >
> >>> > PG Primary Logs:
> >>> > https://www.dropbox.com/s/ipjobn2i5ban9km/backfill-
> primary-log.tgz?dl=0B
> >>> > PG Backfill Target Logs:
> >>> > https://www.dropbox.com/s/9qpiqsnahx0qc5k/backfill-
> target-log.tgz?dl=0
> >>> >
> >>> > I'll be reviewing them with my team tomorrow morning to see if we can
> >>> > find
> >>> > anything.  Thanks for your assistance.
> >>> >
> >>> > Brian
> >>> >
> >>> > On Mon, Jul 25, 2016 at 3:33 PM, Samuel Just 
> wrote:
> >>> >>
> >>> >> The next thing I'd want is for you to reproduce with
> >>> >>
> >>> >> debug osd = 20
> >>> >> debug filestore = 20
> >>> >> debug ms = 1
> >>> >>
> >>> >> and post the file somewhere.
> >>> >> -Sam
> >>> >>
> >>> >> On Mon, Jul 25, 2016 at 1:33 PM, Samuel Just 
> wrote:
> >>> >> > If you don't have the orphaned file link, it's not the same bug.
> >>> >> > -Sam
> >>> >> >
> >>> >> > On Mon, Jul 25, 2016 at 12:55 PM, Brian Felton <
> bjfel...@gmail.com>
> >>> >> > wrote:
> >>> >> >> Sam,
> >>> >> >>
> >>> >> >> I'm reviewing that thread now, but I'm not seeing a lot of
> overlap
> >>> >> >> with
> >>> >> >> my
> >>> >> >> cluster's situation.  For one, I am unable to start either a
> repair
> >>> >> >> or
> >>> >> >> a
> >>> >> >> deep scrub on any of the affected pgs.  I've instructed all six
> of
> >>> >> >> the
> >>> >> >> pgs
> >>> >> >> to scrub, deep-scrub, and repair, and the cluster has been
> gleefully
> >>> >> >> ignoring these requests (it has been several hours since I first
> >>> >> >> tried,
> >>> >> >> and
> >>> >> >> the logs indicate none of the pgs ever scrubbed).  Second, none
> of
> >>> >> >> the
> >>> >> >> my
> >>> >> >> OSDs is crashing.  Third, none of my pgs or objects has ever been
> >>> >> >> marked
> >>> >> >> inconsistent (or unfound, for that matter) -- I'm only seeing the
> >>> >> >> standard
> >>> >> >> mix of degraded/misplaced objects that are common during a
> recovery.
> >>> >> >> What
> >>> >> >> I'm not seeing is any further progress on the number of misplaced
> >>> >> >> objects --
> >>> >> >> the number has remained effectively unchanged for the past
> several
> >>> >> >> days.
> >>> >> >>
> >>> >> >> To be sure, though, I tracked down the file that the backfill
> >>> >> >> operation
> >>> >> >> seems to be hung on, and I can find it in both the backfill
> target
> >>> >> >> osd
> >>> >> >> (580)
> >>> >> >> and a few other osds in the pg.  In all cases, I was able to find
> >>> >> >> the
> >>> >> >> file
> >>> >> >> with an identical hash value on all nodes, and I didn't find any
> >>> >> >> duplicates
> >>> >> >> or potential orphans.  Also, none of the objects involves have
> long
> >>> >> >> names,
> >>> >> >> so they're not using the special ceph long filename handling.
> >>> >> >>
> >>> >> >> Also, we are not using XFS on our OSDs; we are using ZFS 

Re: [ceph-users] rbd-nbd kernel requirements

2016-08-11 Thread Shawn Edwards
Is it thought that this bug is in Ceph and not the kernel?

On Thu, Aug 11, 2016 at 8:14 AM Jason Dillaman  wrote:

> At this point, we only have automated tests that exercise it against
> stock Ubuntu Trusty but that will eventually expand to Xenial once we
> get our lab configured for it.  There is one known issue right now
> where the kernel can deadlock while starting the test [1] and
> mapping/unmapping the device.
>
> Red Hat Enterprise Linux has the nbd driver disabled so we don't test
> against those variants. I've personally run it on Fedora 23 and 24
> during manual testing without any apparent issues.
>
> [1] http://tracker.ceph.com/issues/16921
>
> On Wed, Aug 10, 2016 at 7:16 PM, Shawn Edwards 
> wrote:
> > Is there a minimum kernel version required for rbd-nbd to work and work
> > well?  Before I start stress testing it, I want to be sure I have a
> system
> > that is expected to work.
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
>
> --
> Jason
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Include mon restart in logrotate?

2016-08-11 Thread Eugen Block

How big is your mon store?


ceph@node1:~/ceph-deploy> du -hls /var/lib/ceph/mon/*
31M /var/lib/ceph/mon/ceph-mon1

please note that I restarted the monitor on node1 yesterday, for  
reference the output for node2:


ceph@node2:~> du -h /var/lib/ceph/mon/*
156M/var/lib/ceph/mon/ceph-mon2

ceph@node2:~> systemctl status ceph-mon@mon2.service
ceph-mon@mon2.service - Ceph cluster monitor daemon
   Loaded: loaded (/usr/lib/systemd/system/ceph-mon@.service; enabled)
   Active: active (running) since Do 2016-07-07 17:15:49 CEST; 1  
months 4 days ago


That's not good. That shouldn't happen. The monitor has to trim  
these logs as well.


What could be the problem? Maybe a missing option in the ceph.conf?


Zitat von Wido den Hollander :


Op 11 augustus 2016 om 10:18 schreef Eugen Block :


Thanks for the really quick response!

> Warning! These are not your regular log files.

Thanks for the warning!

> You shouldn't have to worry about that. The MONs should compact and
> rotate those logs themselve.

I believe the compaction works fine, but these large LOG files just
grow until mon restart. Is there no way to limit the size to a desired
value or anything similar?



That's not good. That shouldn't happen. The monitor has to trim  
these logs as well.


How big is your mon store?

$ du -sh /var/lib/ceph/mon/*


> What version of Ceph are you running exactly?

ceph@node1:~/ceph-deploy> ceph --version
ceph version 0.94.6-75



0.94.7 is already out, might be worth upgrading. Release Notes don't  
tell anything about this case though.



> What is the output of ceph -s?

ceph@node1:~/ceph-deploy> ceph -s
 cluster 655cb05a-435a-41ba-83d9-8549f7c36167
  health HEALTH_OK
  monmap e7: 3 mons at
{mon1=192.168.160.15:6789/0,mon2=192.168.160.17:6789/0,mon3=192.168.160.16:6789/0}
 election epoch 242, quorum 0,1,2 mon1,mon2,mon3
  osdmap e2377: 19 osds: 19 up, 19 in
   pgmap v3791457: 4336 pgs, 14 pools, 1551 GB data, 234 kobjects
 3223 GB used, 4929 GB / 8153 GB avail
 4336 active+clean
   client io 0 B/s rd, 72112 B/s wr, 7 op/s



Ok, that's good. Monitors don't trim the logs when the cluster isn't  
healthy, but yours is.


Wido



Zitat von Wido den Hollander :

>> Op 11 augustus 2016 om 9:56 schreef Eugen Block :
>>
>>
>> Hi list,
>>
>> we have a working cluster based on Hammer with 4 nodes, 19 OSDs  
and 3 MONs.

>> Now after a couple of weeks we noticed that we're running out of disk
>> space on one of the nodes in /var.
>> Similar to [1] there are two large LOG files in
>> /var/lib/ceph/mon/ceph-d/store.db/ and I already figured they are
>> managed when the respective MON is restarted. But the MONs are not
>> restarted regularly so the log files can grow for months and fill up
>> the file system.
>>
>
> Warning! These are not your regular log files. They are binary logs
> of LevelDB which are mandatory for the MONs to work!
>
>> I was thinking about adding another file in /etc/logrotate.d/ and
>> trigger a monitor restart once a week. But I'm not sure if it's
>> recommended to restart all MONs at the same time, which could happen
>> if someone started logrotate manually.
>> So my question is, how do you guys manage that and how is it supposed
>> to be handled? I'd really appreciate any insights!
>>
> You shouldn't have to worry about that. The MONs should compact and
> rotate those logs themselve.
>
> They compact their store on start, so that works for you, but they
> should do this while running.
>
> What version of Ceph are you running exactly?
>
> What is the output of ceph -s? MONs usually only compact when the
> cluster is healthy.
>
> Wido
>
>> Regards,
>> Eugen
>>
>> [1]
>>  
http://ceph-users.ceph.narkive.com/PBL3kuhq/large-log-like-files-on-monitor

>>
>> --
>> Eugen Block voice   : +49-40-559 51 75
>> NDE Netzdesign und -entwicklung AG  fax : +49-40-559 51 77
>> Postfach 61 03 15
>> D-22423 Hamburg e-mail  : ebl...@nde.ag
>>
>>  Vorsitzende des Aufsichtsrates: Angelika Mozdzen
>>Sitz und Registergericht: Hamburg, HRB 90934
>>Vorstand: Jens-U. Mozdzen
>> USt-IdNr. DE 814 013 983
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Eugen Block voice   : +49-40-559 51 75
NDE Netzdesign und -entwicklung AG  fax : +49-40-559 51 77
Postfach 61 03 15
D-22423 Hamburg e-mail  : ebl...@nde.ag

 Vorsitzende des Aufsichtsrates: Angelika Mozdzen
   Sitz und Registergericht: Hamburg, HRB 90934
   Vorstand: Jens-U. Mozdzen
USt-IdNr. DE 814 013 983





--
Eugen Block voice   : +49-40-559 51 75
NDE Netzdesign und 

Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now

2016-08-11 Thread Sean Sullivan
Hello Wido,

Thanks for the advice.  While the data center has a/b circuits and
redundant power, etc if a ground fault happens it  travels outside and
fails causing the whole building to fail (apparently).

The monitors are each the same with
2x e5 cpus
64gb of ram
4x 300gb 10k SAS drives in raid 10 (write through mode).
Ubuntu 14.04 with the latest updates prior to power failure (2016/Aug/10 -
3am CST)
Ceph hammer LTS 0.94.7

(we are still working on our jewel test cluster so it is planned but not in
place yet)

The only thing that seems to be corrupt is the monitors leveldb store.  I
see multiple issues on Google leveldb github from March 2016 about fsync
and power failure so I assume this is an issue with leveldb.

I have backed up /var/lib/ceph/Mon on all of my monitors before trying to
proceed with any form of recovery.

Is there any way to reconstruct the leveldb or replace the monitors and
recover the data?

I found the following post in which sage says it is tedious but possible. (
http://www.spinics.net/lists/ceph-devel/msg06662.html). Tedious is fine if
I have any chance of doing it.  I have the fsid, the Mon key map and all of
the osds look to be fine so all of the previous osd maps  are there.

I just don't understand what key/values I need inside.

On Aug 11, 2016 1:33 AM, "Wido den Hollander"  wrote:

>
> > Op 11 augustus 2016 om 0:10 schreef Sean Sullivan <
> seapasu...@uchicago.edu>:
> >
> >
> > I think it just got worse::
> >
> > all three monitors on my other cluster say that ceph-mon can't open
> > /var/lib/ceph/mon/$(hostname). Is there any way to recover if you lose
> all
> > 3 monitors? I saw a post by Sage saying that the data can be recovered as
> > all of the data is held on other servers. Is this possible? If so has
> > anyone had any experience doing so?
>
> I have never done so, so I couldn't tell you.
>
> However, it is weird that on all three it got corrupted. What hardware are
> you using? Was it properly protected against power failure?
>
> If you mon store is corrupted I'm not sure what might happen.
>
> However, make a backup of ALL monitors right now before doing anything.
>
> Wido
>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd-nbd kernel requirements

2016-08-11 Thread Jason Dillaman
At this point, we only have automated tests that exercise it against
stock Ubuntu Trusty but that will eventually expand to Xenial once we
get our lab configured for it.  There is one known issue right now
where the kernel can deadlock while starting the test [1] and
mapping/unmapping the device.

Red Hat Enterprise Linux has the nbd driver disabled so we don't test
against those variants. I've personally run it on Fedora 23 and 24
during manual testing without any apparent issues.

[1] http://tracker.ceph.com/issues/16921

On Wed, Aug 10, 2016 at 7:16 PM, Shawn Edwards  wrote:
> Is there a minimum kernel version required for rbd-nbd to work and work
> well?  Before I start stress testing it, I want to be sure I have a system
> that is expected to work.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs performance benchmark -- metadata intensive

2016-08-11 Thread Brett Niver
Patrick and I had a related question yesterday, are we able to dynamically
vary cache size to artificially manipulate cache pressure?

On Thu, Aug 11, 2016 at 6:07 AM, John Spray  wrote:

> On Thu, Aug 11, 2016 at 8:29 AM, Xiaoxi Chen 
> wrote:
> > Hi ,
> >
> >
> >  Here is the slide I shared yesterday on performance meeting.
> > Thanks and hoping for inputs.
> >
> >
> > http://www.slideshare.net/XiaoxiChen3/cephfs-jewel-mds-
> performance-benchmark
>
> These are definitely useful results and I encourage everyone working
> with cephfs to go and look at Xiaoxi's slides.
>
> The main thing that this highlighted for me was our lack of testing so
> far on systems with full caches.  Too much of our existing testing is
> done on freshly configured systems that never fill the MDS cache.
>
> Test 2.1 notes that we don't enable directory fragmentation by default
> currently -- this is an issue, and I'm hoping we can switch it on by
> default in Kraken (see thread "Switching on mds_bal_frag by default").
> In the meantime we have the fix that Patrick wrote for Jewel which at
> least prevents people creating dirfrags too large for the OSDs to
> handle.
>
> Test 2.2: since a "failing to respond to cache pressure" bug is
> affecting this, I would guess we see the performance fall off at about
> the point where the *client* caches fill up (so they start trimming
> things even though they're ignore cache pressure).  It would be
> interesting to see this chart with addition lines for some related
> perf counters like mds_log.evtrm and mds.inodes_expired, that might
> make it pretty obvious where the MDS is entering different stages that
> see a decrease in the rate of handling client requests.
>
> We really need to sort out the "failing to respond to cache pressure"
> issues that keep popping up, especially if they're still happening on
> a comparatively simple test that is just creating files.  We have a
> specific test for this[1] that is currently being run against the fuse
> client but not the kernel client[2].  This is a good time to try and
> push that forward so I've kicked off an experimental run here:
> http://pulpito.ceph.com/jspray-2016-08-10_16:14:52-
> kcephfs:recovery-master-testing-basic-mira/
>
> In the meantime, although there are reports of similar issues with
> newer kernels, it would be very useful to confirm if the same issue is
> still occurring with more recent kernels.  Issues with cache trimming
> have occurred due to various (separate) bugs, so it's possible that
> while some people are still seeing cache trimming issues with recent
> kernels, the specific case you're hitting might be fixed.
>
> Test 2.3: restarting the MDS doesn't actually give you a completely
> empty cache (everything in the journal gets replayed to pre-populate
> the cache on MDS startup).  However, the results are still valid
> because you're using a different random order in the non-caching test
> case, and the number of inodes in your journal is probably much
> smaller than the overall cache size so it's only a little bit
> populated.  We don't currently have a "drop cache" command built into
> the MDS but it would be pretty easy to add one for use in testing
> (basically just call mds->mdcache->trim(0)).
>
> As one would imagine, the non-caching case is latency-dominated when
> the working set is larger than the cache, where each client is waiting
> for one open to finish before proceeding to the next.  The MDS is
> probably capable of handling many more operations per second, but it
> would need more parallel IO operations from the clients.  When a
> single client is doing opens one by one, you're potentially seeing a
> full network+disk latency for each one (though in practice the OSD
> read cache will be helping a lot here).  This non-caching case would
> be the main argument for giving the metadata pool low latency (SSD)
> storage.
>
> Test 2.5: The observation that the CPU bottleneck makes using fast
> storage for the metadata pool less useful (in sequential/cached cases)
> is valid, although it could still be useful to isolate the metadata
> OSDs (probably SSDs since not so much capacity is needed) to avoid
> competing with data operations.  For random access in the non-caching
> cases (2.3, 2.4) I think you would probably see an improvement from
> SSDs.
>
> Thanks again to the team from ebay for sharing all this.
>
> John
>
>
>
> 1. https://github.com/ceph/ceph-qa-suite/blob/master/tasks/
> cephfs/test_client_limits.py#L96
> 2. http://tracker.ceph.com/issues/9466
>
>
> >
> > Xiaoxi
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

Re: [ceph-users] Include mon restart in logrotate?

2016-08-11 Thread Wido den Hollander

> Op 11 augustus 2016 om 10:18 schreef Eugen Block :
> 
> 
> Thanks for the really quick response!
> 
> > Warning! These are not your regular log files.
> 
> Thanks for the warning!
> 
> > You shouldn't have to worry about that. The MONs should compact and  
> > rotate those logs themselve.
> 
> I believe the compaction works fine, but these large LOG files just  
> grow until mon restart. Is there no way to limit the size to a desired  
> value or anything similar?
> 

That's not good. That shouldn't happen. The monitor has to trim these logs as 
well.

How big is your mon store?

$ du -sh /var/lib/ceph/mon/*

> > What version of Ceph are you running exactly?
> 
> ceph@node1:~/ceph-deploy> ceph --version
> ceph version 0.94.6-75
> 

0.94.7 is already out, might be worth upgrading. Release Notes don't tell 
anything about this case though.

> > What is the output of ceph -s?
> 
> ceph@node1:~/ceph-deploy> ceph -s
>  cluster 655cb05a-435a-41ba-83d9-8549f7c36167
>   health HEALTH_OK
>   monmap e7: 3 mons at  
> {mon1=192.168.160.15:6789/0,mon2=192.168.160.17:6789/0,mon3=192.168.160.16:6789/0}
>  election epoch 242, quorum 0,1,2 mon1,mon2,mon3
>   osdmap e2377: 19 osds: 19 up, 19 in
>pgmap v3791457: 4336 pgs, 14 pools, 1551 GB data, 234 kobjects
>  3223 GB used, 4929 GB / 8153 GB avail
>  4336 active+clean
>client io 0 B/s rd, 72112 B/s wr, 7 op/s
> 

Ok, that's good. Monitors don't trim the logs when the cluster isn't healthy, 
but yours is.

Wido

> 
> Zitat von Wido den Hollander :
> 
> >> Op 11 augustus 2016 om 9:56 schreef Eugen Block :
> >>
> >>
> >> Hi list,
> >>
> >> we have a working cluster based on Hammer with 4 nodes, 19 OSDs and 3 MONs.
> >> Now after a couple of weeks we noticed that we're running out of disk
> >> space on one of the nodes in /var.
> >> Similar to [1] there are two large LOG files in
> >> /var/lib/ceph/mon/ceph-d/store.db/ and I already figured they are
> >> managed when the respective MON is restarted. But the MONs are not
> >> restarted regularly so the log files can grow for months and fill up
> >> the file system.
> >>
> >
> > Warning! These are not your regular log files. They are binary logs  
> > of LevelDB which are mandatory for the MONs to work!
> >
> >> I was thinking about adding another file in /etc/logrotate.d/ and
> >> trigger a monitor restart once a week. But I'm not sure if it's
> >> recommended to restart all MONs at the same time, which could happen
> >> if someone started logrotate manually.
> >> So my question is, how do you guys manage that and how is it supposed
> >> to be handled? I'd really appreciate any insights!
> >>
> > You shouldn't have to worry about that. The MONs should compact and  
> > rotate those logs themselve.
> >
> > They compact their store on start, so that works for you, but they  
> > should do this while running.
> >
> > What version of Ceph are you running exactly?
> >
> > What is the output of ceph -s? MONs usually only compact when the  
> > cluster is healthy.
> >
> > Wido
> >
> >> Regards,
> >> Eugen
> >>
> >> [1]
> >> http://ceph-users.ceph.narkive.com/PBL3kuhq/large-log-like-files-on-monitor
> >>
> >> --
> >> Eugen Block voice   : +49-40-559 51 75
> >> NDE Netzdesign und -entwicklung AG  fax : +49-40-559 51 77
> >> Postfach 61 03 15
> >> D-22423 Hamburg e-mail  : ebl...@nde.ag
> >>
> >>  Vorsitzende des Aufsichtsrates: Angelika Mozdzen
> >>Sitz und Registergericht: Hamburg, HRB 90934
> >>Vorstand: Jens-U. Mozdzen
> >> USt-IdNr. DE 814 013 983
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> -- 
> Eugen Block voice   : +49-40-559 51 75
> NDE Netzdesign und -entwicklung AG  fax : +49-40-559 51 77
> Postfach 61 03 15
> D-22423 Hamburg e-mail  : ebl...@nde.ag
> 
>  Vorsitzende des Aufsichtsrates: Angelika Mozdzen
>Sitz und Registergericht: Hamburg, HRB 90934
>Vorstand: Jens-U. Mozdzen
> USt-IdNr. DE 814 013 983
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs performance benchmark -- metadata intensive

2016-08-11 Thread John Spray
On Thu, Aug 11, 2016 at 8:29 AM, Xiaoxi Chen  wrote:
> Hi ,
>
>
>  Here is the slide I shared yesterday on performance meeting.
> Thanks and hoping for inputs.
>
>
> http://www.slideshare.net/XiaoxiChen3/cephfs-jewel-mds-performance-benchmark

These are definitely useful results and I encourage everyone working
with cephfs to go and look at Xiaoxi's slides.

The main thing that this highlighted for me was our lack of testing so
far on systems with full caches.  Too much of our existing testing is
done on freshly configured systems that never fill the MDS cache.

Test 2.1 notes that we don't enable directory fragmentation by default
currently -- this is an issue, and I'm hoping we can switch it on by
default in Kraken (see thread "Switching on mds_bal_frag by default").
In the meantime we have the fix that Patrick wrote for Jewel which at
least prevents people creating dirfrags too large for the OSDs to
handle.

Test 2.2: since a "failing to respond to cache pressure" bug is
affecting this, I would guess we see the performance fall off at about
the point where the *client* caches fill up (so they start trimming
things even though they're ignore cache pressure).  It would be
interesting to see this chart with addition lines for some related
perf counters like mds_log.evtrm and mds.inodes_expired, that might
make it pretty obvious where the MDS is entering different stages that
see a decrease in the rate of handling client requests.

We really need to sort out the "failing to respond to cache pressure"
issues that keep popping up, especially if they're still happening on
a comparatively simple test that is just creating files.  We have a
specific test for this[1] that is currently being run against the fuse
client but not the kernel client[2].  This is a good time to try and
push that forward so I've kicked off an experimental run here:
http://pulpito.ceph.com/jspray-2016-08-10_16:14:52-kcephfs:recovery-master-testing-basic-mira/

In the meantime, although there are reports of similar issues with
newer kernels, it would be very useful to confirm if the same issue is
still occurring with more recent kernels.  Issues with cache trimming
have occurred due to various (separate) bugs, so it's possible that
while some people are still seeing cache trimming issues with recent
kernels, the specific case you're hitting might be fixed.

Test 2.3: restarting the MDS doesn't actually give you a completely
empty cache (everything in the journal gets replayed to pre-populate
the cache on MDS startup).  However, the results are still valid
because you're using a different random order in the non-caching test
case, and the number of inodes in your journal is probably much
smaller than the overall cache size so it's only a little bit
populated.  We don't currently have a "drop cache" command built into
the MDS but it would be pretty easy to add one for use in testing
(basically just call mds->mdcache->trim(0)).

As one would imagine, the non-caching case is latency-dominated when
the working set is larger than the cache, where each client is waiting
for one open to finish before proceeding to the next.  The MDS is
probably capable of handling many more operations per second, but it
would need more parallel IO operations from the clients.  When a
single client is doing opens one by one, you're potentially seeing a
full network+disk latency for each one (though in practice the OSD
read cache will be helping a lot here).  This non-caching case would
be the main argument for giving the metadata pool low latency (SSD)
storage.

Test 2.5: The observation that the CPU bottleneck makes using fast
storage for the metadata pool less useful (in sequential/cached cases)
is valid, although it could still be useful to isolate the metadata
OSDs (probably SSDs since not so much capacity is needed) to avoid
competing with data operations.  For random access in the non-caching
cases (2.3, 2.4) I think you would probably see an improvement from
SSDs.

Thanks again to the team from ebay for sharing all this.

John



1. 
https://github.com/ceph/ceph-qa-suite/blob/master/tasks/cephfs/test_client_limits.py#L96
2. http://tracker.ceph.com/issues/9466


>
> Xiaoxi
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Include mon restart in logrotate?

2016-08-11 Thread Jan Schermer
I had to make a cronjob to trigger compact on the MONs as well.
Ancient version, though.

Jan

> On 11 Aug 2016, at 10:09, Wido den Hollander  wrote:
> 
> 
>> Op 11 augustus 2016 om 9:56 schreef Eugen Block :
>> 
>> 
>> Hi list,
>> 
>> we have a working cluster based on Hammer with 4 nodes, 19 OSDs and 3 MONs.
>> Now after a couple of weeks we noticed that we're running out of disk  
>> space on one of the nodes in /var.
>> Similar to [1] there are two large LOG files in  
>> /var/lib/ceph/mon/ceph-d/store.db/ and I already figured they are  
>> managed when the respective MON is restarted. But the MONs are not  
>> restarted regularly so the log files can grow for months and fill up  
>> the file system.
>> 
> 
> Warning! These are not your regular log files. They are binary logs of 
> LevelDB which are mandatory for the MONs to work!
> 
>> I was thinking about adding another file in /etc/logrotate.d/ and  
>> trigger a monitor restart once a week. But I'm not sure if it's  
>> recommended to restart all MONs at the same time, which could happen  
>> if someone started logrotate manually.
>> So my question is, how do you guys manage that and how is it supposed  
>> to be handled? I'd really appreciate any insights!
>> 
> You shouldn't have to worry about that. The MONs should compact and rotate 
> those logs themselve.
> 
> They compact their store on start, so that works for you, but they should do 
> this while running.
> 
> What version of Ceph are you running exactly?
> 
> What is the output of ceph -s? MONs usually only compact when the cluster is 
> healthy.
> 
> Wido
> 
>> Regards,
>> Eugen
>> 
>> [1]  
>> http://ceph-users.ceph.narkive.com/PBL3kuhq/large-log-like-files-on-monitor
>> 
>> -- 
>> Eugen Block voice   : +49-40-559 51 75
>> NDE Netzdesign und -entwicklung AG  fax : +49-40-559 51 77
>> Postfach 61 03 15
>> D-22423 Hamburg e-mail  : ebl...@nde.ag
>> 
>> Vorsitzende des Aufsichtsrates: Angelika Mozdzen
>>   Sitz und Registergericht: Hamburg, HRB 90934
>>   Vorstand: Jens-U. Mozdzen
>>USt-IdNr. DE 814 013 983
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Include mon restart in logrotate?

2016-08-11 Thread Eugen Block

Thanks for the really quick response!


Warning! These are not your regular log files.


Thanks for the warning!

You shouldn't have to worry about that. The MONs should compact and  
rotate those logs themselve.


I believe the compaction works fine, but these large LOG files just  
grow until mon restart. Is there no way to limit the size to a desired  
value or anything similar?



What version of Ceph are you running exactly?


ceph@node1:~/ceph-deploy> ceph --version
ceph version 0.94.6-75


What is the output of ceph -s?


ceph@node1:~/ceph-deploy> ceph -s
cluster 655cb05a-435a-41ba-83d9-8549f7c36167
 health HEALTH_OK
 monmap e7: 3 mons at  
{mon1=192.168.160.15:6789/0,mon2=192.168.160.17:6789/0,mon3=192.168.160.16:6789/0}

election epoch 242, quorum 0,1,2 mon1,mon2,mon3
 osdmap e2377: 19 osds: 19 up, 19 in
  pgmap v3791457: 4336 pgs, 14 pools, 1551 GB data, 234 kobjects
3223 GB used, 4929 GB / 8153 GB avail
4336 active+clean
  client io 0 B/s rd, 72112 B/s wr, 7 op/s


Zitat von Wido den Hollander :


Op 11 augustus 2016 om 9:56 schreef Eugen Block :


Hi list,

we have a working cluster based on Hammer with 4 nodes, 19 OSDs and 3 MONs.
Now after a couple of weeks we noticed that we're running out of disk
space on one of the nodes in /var.
Similar to [1] there are two large LOG files in
/var/lib/ceph/mon/ceph-d/store.db/ and I already figured they are
managed when the respective MON is restarted. But the MONs are not
restarted regularly so the log files can grow for months and fill up
the file system.



Warning! These are not your regular log files. They are binary logs  
of LevelDB which are mandatory for the MONs to work!



I was thinking about adding another file in /etc/logrotate.d/ and
trigger a monitor restart once a week. But I'm not sure if it's
recommended to restart all MONs at the same time, which could happen
if someone started logrotate manually.
So my question is, how do you guys manage that and how is it supposed
to be handled? I'd really appreciate any insights!

You shouldn't have to worry about that. The MONs should compact and  
rotate those logs themselve.


They compact their store on start, so that works for you, but they  
should do this while running.


What version of Ceph are you running exactly?

What is the output of ceph -s? MONs usually only compact when the  
cluster is healthy.


Wido


Regards,
Eugen

[1]
http://ceph-users.ceph.narkive.com/PBL3kuhq/large-log-like-files-on-monitor

--
Eugen Block voice   : +49-40-559 51 75
NDE Netzdesign und -entwicklung AG  fax : +49-40-559 51 77
Postfach 61 03 15
D-22423 Hamburg e-mail  : ebl...@nde.ag

 Vorsitzende des Aufsichtsrates: Angelika Mozdzen
   Sitz und Registergericht: Hamburg, HRB 90934
   Vorstand: Jens-U. Mozdzen
USt-IdNr. DE 814 013 983

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Eugen Block voice   : +49-40-559 51 75
NDE Netzdesign und -entwicklung AG  fax : +49-40-559 51 77
Postfach 61 03 15
D-22423 Hamburg e-mail  : ebl...@nde.ag

Vorsitzende des Aufsichtsrates: Angelika Mozdzen
  Sitz und Registergericht: Hamburg, HRB 90934
  Vorstand: Jens-U. Mozdzen
   USt-IdNr. DE 814 013 983

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Include mon restart in logrotate?

2016-08-11 Thread Wido den Hollander

> Op 11 augustus 2016 om 9:56 schreef Eugen Block :
> 
> 
> Hi list,
> 
> we have a working cluster based on Hammer with 4 nodes, 19 OSDs and 3 MONs.
> Now after a couple of weeks we noticed that we're running out of disk  
> space on one of the nodes in /var.
> Similar to [1] there are two large LOG files in  
> /var/lib/ceph/mon/ceph-d/store.db/ and I already figured they are  
> managed when the respective MON is restarted. But the MONs are not  
> restarted regularly so the log files can grow for months and fill up  
> the file system.
> 

Warning! These are not your regular log files. They are binary logs of LevelDB 
which are mandatory for the MONs to work!

> I was thinking about adding another file in /etc/logrotate.d/ and  
> trigger a monitor restart once a week. But I'm not sure if it's  
> recommended to restart all MONs at the same time, which could happen  
> if someone started logrotate manually.
> So my question is, how do you guys manage that and how is it supposed  
> to be handled? I'd really appreciate any insights!
> 
You shouldn't have to worry about that. The MONs should compact and rotate 
those logs themselve.

They compact their store on start, so that works for you, but they should do 
this while running.

What version of Ceph are you running exactly?

What is the output of ceph -s? MONs usually only compact when the cluster is 
healthy.

Wido

> Regards,
> Eugen
> 
> [1]  
> http://ceph-users.ceph.narkive.com/PBL3kuhq/large-log-like-files-on-monitor
> 
> -- 
> Eugen Block voice   : +49-40-559 51 75
> NDE Netzdesign und -entwicklung AG  fax : +49-40-559 51 77
> Postfach 61 03 15
> D-22423 Hamburg e-mail  : ebl...@nde.ag
> 
>  Vorsitzende des Aufsichtsrates: Angelika Mozdzen
>Sitz und Registergericht: Hamburg, HRB 90934
>Vorstand: Jens-U. Mozdzen
> USt-IdNr. DE 814 013 983
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Include mon restart in logrotate?

2016-08-11 Thread Eugen Block

Hi list,

we have a working cluster based on Hammer with 4 nodes, 19 OSDs and 3 MONs.
Now after a couple of weeks we noticed that we're running out of disk  
space on one of the nodes in /var.
Similar to [1] there are two large LOG files in  
/var/lib/ceph/mon/ceph-d/store.db/ and I already figured they are  
managed when the respective MON is restarted. But the MONs are not  
restarted regularly so the log files can grow for months and fill up  
the file system.


I was thinking about adding another file in /etc/logrotate.d/ and  
trigger a monitor restart once a week. But I'm not sure if it's  
recommended to restart all MONs at the same time, which could happen  
if someone started logrotate manually.
So my question is, how do you guys manage that and how is it supposed  
to be handled? I'd really appreciate any insights!


Regards,
Eugen

[1]  
http://ceph-users.ceph.narkive.com/PBL3kuhq/large-log-like-files-on-monitor


--
Eugen Block voice   : +49-40-559 51 75
NDE Netzdesign und -entwicklung AG  fax : +49-40-559 51 77
Postfach 61 03 15
D-22423 Hamburg e-mail  : ebl...@nde.ag

Vorsitzende des Aufsichtsrates: Angelika Mozdzen
  Sitz und Registergericht: Hamburg, HRB 90934
  Vorstand: Jens-U. Mozdzen
   USt-IdNr. DE 814 013 983

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now

2016-08-11 Thread Tomasz Kuzemko
I'm guessing you had writeback cache enabled on ceph-mon disk (smartctl
-g wcache /dev/sdX) and disk firmware did not care about respecting
flush semantics.

On 11.08.2016 08:33, Wido den Hollander wrote:
> 
>> Op 11 augustus 2016 om 0:10 schreef Sean Sullivan :
>>
>>
>> I think it just got worse::
>>
>> all three monitors on my other cluster say that ceph-mon can't open
>> /var/lib/ceph/mon/$(hostname). Is there any way to recover if you lose all
>> 3 monitors? I saw a post by Sage saying that the data can be recovered as
>> all of the data is held on other servers. Is this possible? If so has
>> anyone had any experience doing so?
> 
> I have never done so, so I couldn't tell you.
> 
> However, it is weird that on all three it got corrupted. What hardware are 
> you using? Was it properly protected against power failure?
> 
> If you mon store is corrupted I'm not sure what might happen.
> 
> However, make a backup of ALL monitors right now before doing anything.
> 
> Wido
> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Tomasz Kuzemko
tomasz.kuze...@corp.ovh.com



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now

2016-08-11 Thread Wido den Hollander

> Op 11 augustus 2016 om 0:10 schreef Sean Sullivan :
> 
> 
> I think it just got worse::
> 
> all three monitors on my other cluster say that ceph-mon can't open
> /var/lib/ceph/mon/$(hostname). Is there any way to recover if you lose all
> 3 monitors? I saw a post by Sage saying that the data can be recovered as
> all of the data is held on other servers. Is this possible? If so has
> anyone had any experience doing so?

I have never done so, so I couldn't tell you.

However, it is weird that on all three it got corrupted. What hardware are you 
using? Was it properly protected against power failure?

If you mon store is corrupted I'm not sure what might happen.

However, make a backup of ALL monitors right now before doing anything.

Wido

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com