Re: [ceph-users] ceph warning

2016-09-01 Thread Ishmael Tsoaela
Thanks,

I'll keep that in mind. I appreciate the assitance.


everything looks good this morning.

cluster df3f96d8-3889-4baa-8b27-cc2839141425
 health HEALTH_OK
 monmap e7: 3 mons at {Monitors}
election epoch 118, quorum 0,1,2 nodeB,nodeC,nodeD
 osdmap e5246: 18 osds: 18 up, 18 in
flags sortbitwise
  pgmap v2503286: 640 pgs, 3 pools, 3338 GB data, 864 kobjects
10101 GB used, 6648 GB / 16749 GB avail
 639 active+clean
   1 active+clean+scrubbing+deep
  client io 77567 B/s wr, 0 op/s rd, 23 op/s wr


ID WEIGHT  REWEIGHT SIZE   USEAVAIL %USE  VAR  PGS
 3 0.90868  1.0   930G   604G  326G 64.93 1.08  93
 5 0.90868  1.0   930G   660G  269G 71.01 1.18 117
 6 0.90868  1.0   930G   572G  358G 61.48 1.02 118
 0 0.90868  1.0   930G   641G  288G 68.96 1.14 108
 2 0.90868  1.0   930G   470G  460G 50.51 0.84 115
 8 0.90869  1.0   930G   420G  509G 45.21 0.75  89
 1 0.90868  1.0   930G   565G  364G 60.82 1.01 107
 4 0.90868  1.0   930G   586G  344G 63.01 1.04 119
 7 0.90868  1.0   930G   513G  416G 55.23 0.92  98
10 0.90868  1.0   930G   425G  504G 45.74 0.76  89
13 0.90868  1.0   930G   589G  341G 63.30 1.05 112
 9 0.90869  1.0   930G   681G  249G 73.19 1.21 115
15 0.90869  1.0   930G   556G  373G 59.82 0.99 100
16 0.90869  1.0   930G   477G  452G 51.35 0.85 103
17 0.90868  1.0   930G   666G  263G 71.64 1.19 108
18 0.90869  1.0   930G   571G  358G 61.43 1.02 109
19 0.90868  1.0   930G   566G  363G 60.89 1.01 116
11 0.90869  1.0   930G   530G  400G 57.00 0.95 104


On Fri, Sep 2, 2016 at 2:59 AM, Christian Balzer  wrote:
>
> Hello,
>
> On Thu, 1 Sep 2016 16:24:28 +0200 Ishmael Tsoaela wrote:
>
>> I did set configure the following during my initial setup:
>>
>> osd pool default size = 3
>>
> Ah yes, so not this.
> (though the default "rbd" pool that's initially created tended to ignore
> that parameter and would default to 3 in any case)
>
> In fact I remember now writing about this before, you're looking at CRUSH
> in action and the corner cases of a small cluster.
>
> What happened here is that when the OSDs of your 3rd node were gone (down
> and out) CRUSH recalculated the locations of PGs based on the new reality
> and started to move things around.
> And unlike with a larger cluster (4+ nodes) or a single OSD failure, it
> did NOT remove the "old" data after the move, since your replication level
> wasn't achievable.
>
> So this is what filled up your OSDs, moving (copying really) PGs to their
> newly calculated location while not deleting the data at the old location
> afterwards.
>
> As said before, you will want to set
> "mon_osd_down_out_subtree_limit = host"
> at least until your cluster is n+1 sized (4 nodes or more).
>
> Adding more OSDs (and keeping usage below 60% or so) would also achieve
> this, but a 4th node would be more helpful performance wise.
>
> Christian
>
>>
>>
>> root@nodeC:/mnt/vmimages# ceph osd dump | grep "replicated size"
>> pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash
>> rjenkins pg_num 64 pgp_num 64 last_change 217 flags hashpspool
>> stripe_width 0
>> pool 4 'vmimages' replicated size 3 min_size 2 crush_ruleset 0
>> object_hash rjenkins pg_num 64 pgp_num 64 last_change 242 flags
>> hashpspool stripe_width 0
>> pool 5 'vmimage-backups' replicated size 3 min_size 2 crush_ruleset 0
>> object_hash rjenkins pg_num 512 pgp_num 512 last_change 777 flags
>> hashpspool stripe_width 0
>>
>>
>> After adding 3 more osd, I see data is being replicated to the new osd
>> :) and near full osd warning is gone.
>>
>> recovery some hours ago:
>>
>>
>> >>  recovery 389973/3096070 objects degraded (12.596%)
>> >>  recovery 1258984/3096070 objects misplaced (40.664%)
>>
>> recovery now:
>>
>> recovery 8917/3217724 objects degraded (0.277%)
>> recovery 1120479/3217724 objects misplaced (34.822%)
>>
>>
>>
>>
>>
>> On Thu, Sep 1, 2016 at 4:13 PM, Christian Balzer  wrote:
>> >
>> > Hello,
>> >
>> > On Thu, 1 Sep 2016 14:00:53 +0200 Ishmael Tsoaela wrote:
>> >
>> >> more questions and I hope you don;t mind:
>> >>
>> >>
>> >>
>> >> My understanding is that if I have 3 hosts with 5 osd each, 1 host
>> >> goes down, Ceph should not replicate to the osd that are down.
>> >>
>> > How could it replicate to something that is down?
>> >
>> >> When the host comes up, only then the replication will commence right?
>> >>
>> > Depends on your configuration.
>> >
>> >> If only 1 osd out of 5 comes up, then only data meant for that osd
>> >> should be copied to the osd? if so then why do pg get full if they
>> >> were not full before osd went down?
>> >>
>> > Definitely not.
>> >
>> >>
>> > You need to understand how  CRUSH maps, rules and replication work.
>> >
>> > By default pools with Hammer and higher with will have a replicate size
>> > of 3 and CRUSH picks OSDs based on a host 

Re: [ceph-users] cephfs page cache

2016-09-01 Thread Yan, Zheng
I think about this again. This issue could be caused by stale session.
Could you check kernel logs of your servers. Are there any ceph
related kernel message (such as "ceph: mds0 caps stale")

Regards
Yan, Zheng


On Thu, Sep 1, 2016 at 11:02 PM, Sean Redmond  wrote:
> Hi,
>
> It seems to be using syscall mmap() from what I read this indicates it is
> using memory-mapped IO.
>
> Please see a strace here: http://pastebin.com/6wjhSNrP
>
> Thanks
>
> On Wed, Aug 31, 2016 at 5:51 PM, Sean Redmond 
> wrote:
>>
>> I am not sure how to tell?
>>
>> Server1 and Server2 mount the ceph file system using kernel client 4.7.2
>> and I can replicate the problem using '/usr/bin/sum' to read the file or a
>> http GET request via a web server (apache).
>>
>> On Wed, Aug 31, 2016 at 2:38 PM, Yan, Zheng  wrote:
>>>
>>> On Wed, Aug 31, 2016 at 12:49 AM, Sean Redmond 
>>> wrote:
>>> > Hi,
>>> >
>>> > I have been able to pick through the process a little further and
>>> > replicate
>>> > it via the command line. The flow seems looks like this:
>>> >
>>> > 1) The user uploads an image to webserver server 'uploader01' it gets
>>> > written to a path such as
>>> > '/cephfs/webdata/static/456/JHL/66448H-755h.jpg'
>>> > on cephfs
>>> >
>>> > 2) The MDS makes the file meta data available for this new file
>>> > immediately
>>> > to all clients.
>>> >
>>> > 3) The 'uploader01' server asynchronously commits the file contents to
>>> > disk
>>> > as sync is not explicitly called during the upload.
>>> >
>>> > 4) Before step 3 is done the visitor requests the file via one of two
>>> > web
>>> > servers server1 or server2 - the MDS provides the meta data but the
>>> > contents
>>> > of the file is not committed to disk yet so the data read returns 0's -
>>> > This
>>> > is then cached by the file system page cache until it expires or is
>>> > flushed
>>> > manually.
>>>
>>> do server1 or server2 use memory-mapped IO to read the file?
>>>
>>> Regards
>>> Yan, Zheng
>>>
>>> >
>>> > 5) As step 4 typically only happens on one of the two web servers
>>> > before
>>> > step 3 is complete we get the mismatch between server1 and server2 file
>>> > system page cache.
>>> >
>>> > The below demonstrates how to reproduce this issue
>>> >
>>> > http://pastebin.com/QK8AemAb
>>> >
>>> > As we can see the checksum of the file returned by the web server is 0
>>> > as
>>> > the file contents has not been flushed to disk from server uploader01
>>> >
>>> > If however we call ‘sync’ as shown below the checksum is correct:
>>> >
>>> > http://pastebin.com/p4CfhEFt
>>> >
>>> > If we also wait for 10 seconds for the kernel to flush the dirty pages,
>>> > we
>>> > can also see the checksum is valid:
>>> >
>>> > http://pastebin.com/1w6UZzNQ
>>> >
>>> > It looks it maybe a race between the time it takes the uploader01
>>> > server to
>>> > commit the file to the file system and the fast incoming read request
>>> > from
>>> > the visiting user to server1 or server2.
>>> >
>>> > Thanks
>>> >
>>> >
>>> > On Tue, Aug 30, 2016 at 10:21 AM, Sean Redmond
>>> > 
>>> > wrote:
>>> >>
>>> >> You are correct it only seems to impact recently modified files.
>>> >>
>>> >> On Tue, Aug 30, 2016 at 3:36 AM, Yan, Zheng  wrote:
>>> >>>
>>> >>> On Tue, Aug 30, 2016 at 2:11 AM, Gregory Farnum 
>>> >>> wrote:
>>> >>> > On Mon, Aug 29, 2016 at 7:14 AM, Sean Redmond
>>> >>> > 
>>> >>> > wrote:
>>> >>> >> Hi,
>>> >>> >>
>>> >>> >> I am running cephfs (10.2.2) with kernel 4.7.0-1. I have noticed
>>> >>> >> that
>>> >>> >> frequently static files are showing empty when serviced via a web
>>> >>> >> server
>>> >>> >> (apache). I have tracked this down further and can see when
>>> >>> >> running a
>>> >>> >> checksum against the file on the cephfs file system on the node
>>> >>> >> serving the
>>> >>> >> empty http response the checksum is '0'
>>> >>> >>
>>> >>> >> The below shows the checksum on a defective node.
>>> >>> >>
>>> >>> >> [root@server2]# ls -al
>>> >>> >> /cephfs/webdata/static/456/JHL/66448H-755h.jpg
>>> >>> >> -rw-r--r-- 1 apache apache 53317 Aug 28 23:46
>>> >>> >> /cephfs/webdata/static/456/JHL/66448H-755h.jpg
>>> >>>
>>> >>> It seems this file was modified recently. Maybe the web server
>>> >>> silently modifies the files. Please check if this issue happens on
>>> >>> older files.
>>> >>>
>>> >>> Regards
>>> >>> Yan, Zheng
>>> >>>
>>> >>> >>
>>> >>> >> [root@server2]# sum /cephfs/webdata/static/456/JHL/66448H-755h.jpg
>>> >>> >> 053
>>> >>> >
>>> >>> > So can we presume there are no file contents, and it's just 53
>>> >>> > blocks
>>> >>> > of zeros?
>>> >>> >
>>> >>> > This doesn't sound familiar to me; Zheng, do you have any ideas?
>>> >>> > Anyway, ceph-fuse shouldn't be susceptible to this bug even with
>>> >>> > the
>>> >>> > page cache enabled; if you're just serving stuff via 

Re: [ceph-users] Slow Request on OSD

2016-09-01 Thread Dan Jakubiec
Thanks you for all the help Wido:

> On Sep 1, 2016, at 14:03, Wido den Hollander  wrote:
> 
> You have to mark those OSDs as lost and also force create the incomplete PGs.
> 

This might be the root of our problems.  We didn't mark the parent OSD as 
"lost" before we removed it.  Now ceph won't let us mark it as lost (and it is 
no longer in the OSD tree):

djakubiec@dev:~$ ceph osd lost 8 --yes-i-really-mean-it
osd.8 is not down or doesn't exist


djakubiec@dev:~$ ceph osd tree
ID WEIGHT   TYPE NAME   UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 58.19960 root default
-2  7.27489 host node24
 1  7.27489 osd.1up  1.0  1.0
-3  7.27489 host node25
 2  7.27489 osd.2up  1.0  1.0
-4  7.27489 host node26
 3  7.27489 osd.3up  1.0  1.0
-5  7.27489 host node27
 4  7.27489 osd.4up  1.0  1.0
-6  7.27489 host node28
 5  7.27489 osd.5up  1.0  1.0
-7  7.27489 host node29
 6  7.27489 osd.6up  1.0  1.0
-8  7.27539 host node30
 9  7.27539 osd.9up  1.0  1.0
-9  7.27489 host node31
 7  7.27489 osd.7up  1.0  1.0

BUT, even though OSD 8 no longer exists I see still lots of references to OSD 8 
in various dumps and query's.

Interestingly do still see weird entries in the CRUSH map (should I do 
something about these?):

# devices
device 0 device0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 device8
device 9 osd.9

I then tried on all 80 incomplete PGs:

ceph pg force_create_pg 

The 80 PGs moved to "creating" for a few minutes but then all went back to 
"incomplete".

Is there some way to force individual PGs to be marked as "lost"?

Thanks!

-- Dan


> But I think you have lost so many objects that the cluster is beyond a point 
> of repair honestly.
> 
> Wido
> 
>> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph warning

2016-09-01 Thread Christian Balzer

Hello,

On Thu, 1 Sep 2016 16:24:28 +0200 Ishmael Tsoaela wrote:

> I did set configure the following during my initial setup:
> 
> osd pool default size = 3
> 
Ah yes, so not this.
(though the default "rbd" pool that's initially created tended to ignore
that parameter and would default to 3 in any case)

In fact I remember now writing about this before, you're looking at CRUSH
in action and the corner cases of a small cluster.

What happened here is that when the OSDs of your 3rd node were gone (down
and out) CRUSH recalculated the locations of PGs based on the new reality
and started to move things around. 
And unlike with a larger cluster (4+ nodes) or a single OSD failure, it
did NOT remove the "old" data after the move, since your replication level
wasn't achievable.

So this is what filled up your OSDs, moving (copying really) PGs to their
newly calculated location while not deleting the data at the old location
afterwards.

As said before, you will want to set 
"mon_osd_down_out_subtree_limit = host"
at least until your cluster is n+1 sized (4 nodes or more).

Adding more OSDs (and keeping usage below 60% or so) would also achieve
this, but a 4th node would be more helpful performance wise.

Christian

> 
> 
> root@nodeC:/mnt/vmimages# ceph osd dump | grep "replicated size"
> pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash
> rjenkins pg_num 64 pgp_num 64 last_change 217 flags hashpspool
> stripe_width 0
> pool 4 'vmimages' replicated size 3 min_size 2 crush_ruleset 0
> object_hash rjenkins pg_num 64 pgp_num 64 last_change 242 flags
> hashpspool stripe_width 0
> pool 5 'vmimage-backups' replicated size 3 min_size 2 crush_ruleset 0
> object_hash rjenkins pg_num 512 pgp_num 512 last_change 777 flags
> hashpspool stripe_width 0
> 
> 
> After adding 3 more osd, I see data is being replicated to the new osd
> :) and near full osd warning is gone.
> 
> recovery some hours ago:
> 
> 
> >>  recovery 389973/3096070 objects degraded (12.596%)
> >>  recovery 1258984/3096070 objects misplaced (40.664%)
> 
> recovery now:
> 
> recovery 8917/3217724 objects degraded (0.277%)
> recovery 1120479/3217724 objects misplaced (34.822%)
> 
> 
> 
> 
> 
> On Thu, Sep 1, 2016 at 4:13 PM, Christian Balzer  wrote:
> >
> > Hello,
> >
> > On Thu, 1 Sep 2016 14:00:53 +0200 Ishmael Tsoaela wrote:
> >
> >> more questions and I hope you don;t mind:
> >>
> >>
> >>
> >> My understanding is that if I have 3 hosts with 5 osd each, 1 host
> >> goes down, Ceph should not replicate to the osd that are down.
> >>
> > How could it replicate to something that is down?
> >
> >> When the host comes up, only then the replication will commence right?
> >>
> > Depends on your configuration.
> >
> >> If only 1 osd out of 5 comes up, then only data meant for that osd
> >> should be copied to the osd? if so then why do pg get full if they
> >> were not full before osd went down?
> >>
> > Definitely not.
> >
> >>
> > You need to understand how  CRUSH maps, rules and replication work.
> >
> > By default pools with Hammer and higher with will have a replicate size
> > of 3 and CRUSH picks OSDs based on a host failure domain, so that's why
> > you need at least 3 hosts with those default settings.
> >
> > So with these defaults Ceph would indeed have done nothing in a 3 node
> > cluster if one node had gone down.
> > It needs to put replicas on different nodes, but only 2 are available.
> >
> > However given what happened to your cluster it is obvious that your pools
> > have a replication size of 2 most likely.
> > Check with
> > ceph osd dump | grep "replicated size"
> >
> > In that case Ceph will try to recover and restore 2 replicas (original and
> > copy), resulting in what you're seeing.
> >
> > Christian
> >
> >>
> >> On Thu, Sep 1, 2016 at 1:29 PM, Ishmael Tsoaela  
> >> wrote:
> >> > Thank you again.
> >> >
> >> > I will add 3 more osd today and leave untouched, maybe over weekend.
> >> >
> >> > On Thu, Sep 1, 2016 at 1:16 PM, Christian Balzer  wrote:
> >> >>
> >> >> Hello,
> >> >>
> >> >> On Thu, 1 Sep 2016 11:20:33 +0200 Ishmael Tsoaela wrote:
> >> >>
> >> >>> thanks for the response
> >> >>>
> >> >>>
> >> >>>
> >> >>> > You really will want to spend more time reading documentation and 
> >> >>> > this ML,
> >> >>> > as well as using google to (re-)search things.
> >> >>>
> >> >>>
> >> >>>  I did do some reading on the error but cannot understand why they do
> >> >>> not clear even after so long.
> >> >>>
> >> >>> > In your previous mail you already mentioned a 92% full OSD, that 
> >> >>> > should
> >> >>> > combined with the various "full" warnings have impressed on you the 
> >> >>> > need
> >> >>> > to address this issue.
> >> >>>
> >> >>> > When your nodes all rebooted, did everything come back up?
> >> >>>
> >> >>> One host with 5 osd were down nad came up later.
> >> >>>
> >> >>> > And if so (as the 15 osds: 15 up, 15 in suggest), how much 

Re: [ceph-users] vmware + iscsi + tgt + reservations

2016-09-01 Thread Brad Hubbard
On Fri, Sep 2, 2016 at 7:41 AM, Oliver Dzombic  wrote:
> Hi,
>
> i know, this is not really ceph related anymore. But i guess it could be
> helpful for others too.
>
> I was using:
>
> https://ceph.com/dev-notes/adding-support-for-rbd-to-stgt/
>
> and i am currently running in a problem, where
>
> ONE LUN
>
> is connected to
>
> TWO Nodes ( esxi 6.0 )

What filesystem are you using on the LUN?

>
> And the 2nd node is unable to do any kind of write operations on the
> (successful mounted, and readable ) lun.

Depending on the filesystem you just corrupted it by mounting it
concurrently on two hosts.

>
> As it seems, it has to do with reservations.
>
> So the question is now, how to solve that.
>
> The vmware log says:
>
> 2016-09-01T21:09:54.281Z cpu18:33538)NMP: nmp_PathDetermineFailure:3002:
> SCSI cmd RESERVE failed on path vmhba37:C0:T0:L1, reservation state on
> device naa.6e010001 is unknown.
>
>
> tgtd --version
> 1.0.55
>
> Any help / idea is appriciated !
>
> Thank you !
>
> --
> Mit freundlichen Gruessen / Best regards
>
> Oliver Dzombic
> IP-Interactive
>
> mailto:i...@ip-interactive.de
>
> Anschrift:
>
> IP Interactive UG ( haftungsbeschraenkt )
> Zum Sonnenberg 1-3
> 63571 Gelnhausen
>
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
>
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] vmware + iscsi + tgt + reservations

2016-09-01 Thread Oliver Dzombic
Hi,

i know, this is not really ceph related anymore. But i guess it could be
helpful for others too.

I was using:

https://ceph.com/dev-notes/adding-support-for-rbd-to-stgt/

and i am currently running in a problem, where

ONE LUN

is connected to

TWO Nodes ( esxi 6.0 )

And the 2nd node is unable to do any kind of write operations on the
(successful mounted, and readable ) lun.

As it seems, it has to do with reservations.

So the question is now, how to solve that.

The vmware log says:

2016-09-01T21:09:54.281Z cpu18:33538)NMP: nmp_PathDetermineFailure:3002:
SCSI cmd RESERVE failed on path vmhba37:C0:T0:L1, reservation state on
device naa.6e010001 is unknown.


tgtd --version
1.0.55

Any help / idea is appriciated !

Thank you !

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Strange copy errors in osd log

2016-09-01 Thread Samuel Just
If it's bluestore, this is pretty likely to be a bluestore bug.  If
you are interested in experimenting with bluestore, you probably want
to watch developements on the master branch, it's undergoing a bunch
of changes right now.
-Sam

On Thu, Sep 1, 2016 at 1:54 PM, Виталий Филиппов  wrote:
> Hi! I'm playing with a test setup of ceph jewel with bluestore and cephfs
> over erasure-coded pool with replicated pool as a cache tier. After writing
> some number of small files to cephfs I begin seeing the following error
> messages during the migration of data from cache to EC pool:
>
> 2016-09-01 10:19:27.364710 7f37c1a09700 -1 osd.0 pg_epoch: 329 pg[6.2cs0( v
> 329'388 (0'0,329'388] local-les=315 n=326 ec=279 les/c/f 315/315/0
> 314/314/314) [0,1,2] r=0 lpr=314 crt=329'387 lcod 329'387 mlcod 329'387
> active+clean] process_copy_chunk data digest 0x648fd38c != source 0x40203b61
> 2016-09-01 10:19:27.364742 7f37c1a09700 -1 log_channel(cluster) log [ERR] :
> 6.2cs0 copy from 8:372dc315:::200.002b:head to
> 6:372dc315:::200.002b:head data digest 0x648fd38c != source 0x40203b61
>
> These messages then repeat infinitely for the same set of objects with some
> interval. I'm not sure - does this mean some objects are corrupted in OSDs?
> (how to check?) Is it a bug at all?
>
> P.S: I've also reported this as an issue:
> http://tracker.ceph.com/issues/17194 (not sure if it was right to do :))
>
> --
> With best regards,
>   Vitaliy Filippov
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Auto recovering after loosing all copies of a PG(s)

2016-09-01 Thread Wido den Hollander

> Op 1 september 2016 om 17:37 schreef Iain Buclaw :
> 
> 
> On 16 August 2016 at 17:13, Wido den Hollander  wrote:
> >
> >> Op 16 augustus 2016 om 15:59 schreef Iain Buclaw :
> >>
> >>
> >> The desired behaviour for me would be for the client to get an instant
> >> "not found" response from stat() operations.  For write() to recreate
> >> unfound objects.  And for missing placement groups to be recreated on
> >> an OSD that isn't overloaded.  Halting the entire cluster when 96% of
> >> it can still be accessed is just not workable, I'm afraid.
> >>
> >
> > Well, you can't make Ceph do that, but you can make librados do such a 
> > thing.
> >
> > I'm using the OSD and MON timeout settings in libvirt for example: 
> > http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/storage/storage_backend_rbd.c;h=9665fbca3a18fbfc7e4caec3ee8e991e13513275;hb=HEAD#l157
> >
> > You can set these options:
> > - client_mount_timeout
> > - rados_mon_op_timeout
> > - rados_osd_op_timeout
> >
> > Where I think only the last two should be sufficient in your case.
> >
> > You wel get ETIMEDOUT back as error when a operation times out.
> >
> > Wido
> >
> 
> This seems to be fine.
> 
> Now what to do when a DR situation happens.
> 
> 
>   pgmap v592589: 4096 pgs, 1 pools, 1889 GB data, 244 Mobjects
> 2485 GB used, 10691 GB / 13263 GB avail
> 3902 active+clean
>  128 creating
>   66 incomplete
> 
> 
> These PGs just never seem to finish creating.
> 

I have seen that happen as well, you sometimes need to restart the OSDs to let 
the create finish.

Wido

> -- 
> Iain Buclaw
> 
> *(p < e ? p++ : p) = (c & 0x0f) + '0';
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Strange copy errors in osd log

2016-09-01 Thread Виталий Филиппов
Hi! I'm playing with a test setup of ceph jewel with bluestore and cephfs  
over erasure-coded pool with replicated pool as a cache tier. After  
writing some number of small files to cephfs I begin seeing the following  
error messages during the migration of data from cache to EC pool:


2016-09-01 10:19:27.364710 7f37c1a09700 -1 osd.0 pg_epoch: 329 pg[6.2cs0(  
v 329'388 (0'0,329'388] local-les=315 n=326 ec=279 les/c/f 315/315/0  
314/314/314) [0,1,2] r=0 lpr=314 crt=329'387 lcod 329'387 mlcod 329'387  
active+clean] process_copy_chunk data digest 0x648fd38c != source  
0x40203b61
2016-09-01 10:19:27.364742 7f37c1a09700 -1 log_channel(cluster) log [ERR]  
: 6.2cs0 copy from 8:372dc315:::200.002b:head to  
6:372dc315:::200.002b:head data digest 0x648fd38c != source 0x40203b61


These messages then repeat infinitely for the same set of objects with  
some interval. I'm not sure - does this mean some objects are corrupted in  
OSDs? (how to check?) Is it a bug at all?


P.S: I've also reported this as an issue:  
http://tracker.ceph.com/issues/17194 (not sure if it was right to do :))


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow Request on OSD

2016-09-01 Thread Dan Jakubiec
Thanks Wido.  Reed and I have been working together to try to restore this 
cluster for about 3 weeks now.  I have been accumulating a number of failure 
modes that I am hoping to share with the Ceph group soon, but have been holding 
off a bit until we see the full picture clearly so that we can provide some 
succinct observations.

We know that losing 6 of 8 OSDs was definitely going to result in data loss, so 
I think we are resigned to that.  What has been difficult for us is that there 
have been many steps in the rebuild process that seem to get stuck and need our 
intervention.  But it is not 100% obvious what interventions we should applying.

My very over-simplied hope was this:

We would remove the corrupted OSDs from the cluster
We would replace them with new OSDs
Ceph would figure out that a lot of PGs were lost
We would "agree and say okay -- lose the objects/files"
The cluster would use what remains and return to working state

I feel we have done something wrong along the way, and at this point we are 
trying to figure out how to do step #4 completely.  We are about to follow the 
steps to "mark unfound lost", which makes sense to me... but I'm not sure what 
to do about all the other inconsistencies.

What procedure do we need to follow to just tell Ceph "those PGs are lost, 
let's move on"?

===

A very quick history of what we did to get here:

8 OSDs lost power simultaneously.
2 OSDs came back without issues.
1 OSD wouldn't start (various assertion failures), but we were able to copy its 
PGs to a new OSD as follows:
ceph-objectstore-tool "export"
ceph osd crush rm osd.N
ceph auth del osd.N
ceph os rm osd.N
Create new OSD from scrach (it got a new OSD ID)
ceph-objectstore-tool "import"
The remaining 5 OSDs were corrupt beyond repair (could not export, mostly due 
to missing leveldb files after xfs_repair).  We redeployed them as follows:
ceph osd crush rm osd.N
ceph auth del osd.N
ceph os rm osd.N
Create new OSD from scratch (it got the same OSD ID as the old OSD)

All the new OSDs from #4.4 ended up getting the same OSD ID as the original 
OSD.  Don't know if that is part of the problem?  It seems like doing the 
"crush rm" should have advised the cluster correctly, but perhaps not?

Where did we go wrong in the recovery process?

Thank you!

-- Dan

> On Sep 1, 2016, at 00:18, Wido den Hollander  wrote:
> 
> 
>> Op 31 augustus 2016 om 23:21 schreef Reed Dier :
>> 
>> 
>> Multiple XFS corruptions, multiple leveldb issues. Looked to be result of 
>> write cache settings which have been adjusted now.
>> 
> 
> That is bad news, really bad.
> 
>> You’ll see below that there are tons of PG’s in bad states, and it was 
>> slowly but surely bringing the number of bad PGs down, but it seems to have 
>> hit a brick wall with this one slow request operation.
>> 
> 
> No, you have more issues. You can 17 PGs which are incomplete, a few 
> down+incomplete.
> 
> Without those PGs functioning (active+X) your MDS will probably not work.
> 
> Take a look at: 
> http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/
> 
> Make sure you go to HEALTH_WARN at first, in HEALTH_ERR the MDS will never 
> come online.
> 
> Wido
> 
>>> ceph -s
>>> cluster []
>>> health HEALTH_ERR
>>>292 pgs are stuck inactive for more than 300 seconds
>>>142 pgs backfill_wait
>>>135 pgs degraded
>>>63 pgs down
>>>80 pgs incomplete
>>>199 pgs inconsistent
>>>2 pgs recovering
>>>5 pgs recovery_wait
>>>1 pgs repair
>>>132 pgs stale
>>>160 pgs stuck inactive
>>>132 pgs stuck stale
>>>71 pgs stuck unclean
>>>128 pgs undersized
>>>1 requests are blocked > 32 sec
>>>recovery 5301381/46255447 objects degraded (11.461%)
>>>recovery 6335505/46255447 objects misplaced (13.697%)
>>>recovery 131/20781800 unfound (0.001%)
>>>14943 scrub errors
>>>mds cluster is degraded
>>> monmap e1: 3 mons at {core=[]:6789/0,db=[]:6789/0,dev=[]:6789/0}
>>>election epoch 262, quorum 0,1,2 core,dev,db
>>>  fsmap e3627: 1/1/1 up {0=core=up:replay}
>>> osdmap e3685: 8 osds: 8 up, 8 in; 153 remapped pgs
>>>flags sortbitwise
>>>  pgmap v1807138: 744 pgs, 10 pools, 7668 GB data, 20294 kobjects
>>>8998 GB used, 50598 GB / 59596 GB avail
>>>5301381/46255447 objects degraded (11.461%)
>>>6335505/46255447 objects misplaced (13.697%)
>>>131/20781800 unfound (0.001%)
>>> 209 active+clean
>>> 170 active+clean+inconsistent
>>> 112 stale+active+clean
>>>  74 undersized+degraded+remapped+wait_backfill+peered
>>>  63 down+incomplete
>>>  48 active+undersized+degraded+remapped+wait_backfill
>>> 

[ceph-users] CDM Reminder

2016-09-01 Thread Patrick McGarry
Hey cephers,

Just a reminder that this month’s Ceph Developer Monthly meeting will
be next Wed 07 Sep @ 9p EDT (it’s an APAC-friendly month). Please
submit your blueprints to:

http://wiki.ceph.com/CDM_07-SEP-2016

If you have any questions or concerns, please feel free to send them
my way. Thanks.


-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Turn snapshot of a flattened snapshot into regular image

2016-09-01 Thread Steve Taylor
Something isn't right. Ceph won't delete RBDs that have existing snapshots, 
even when those snapshots aren't protected. You can't delete a snapshot that's 
protected, and you can't unprotect a snapshot if there is a COW clone that 
depends on it.

I'm not intimately familiar with OpenStack, but it must be deleting A without 
any snapshots. That would seem to indicate that at the point of deletion there 
are no COW clones of A or that any clone is no longer dependent on A. A COW 
clone requires a protected snapshot, a protected snapshot can't be deleted, and 
existing snapshots prevent RBDs from being deleted.

In my experience with OpenStack, booting a nova instance from a glance image 
causes a snapshot to be created, protected, and cloned on the RBD for the 
glance image. The clone becomes a cinder device that is then attached to the 
nova instance. Thus you're able to modify the contents of the volume within the 
instance. You wouldn't be able to delete the glance image at that point unless 
the cinder device were deleted first or it was flattened and no longer 
dependent on the glance image. I haven't performed this particular test. It's 
possible that OpenStack does the flattening for you in this scenario.

This issue will likely require some investigation at the RBD level throughout 
your testing process to understand exactly what's happening.




[cid:image5feece.JPG@7cacebfd.42833f4d]   Steve 
Taylor | Senior Software Engineer | StorageCraft Technology 
Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799



If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.



-Original Message-
From: Eugen Block [mailto:ebl...@nde.ag]
Sent: Thursday, September 1, 2016 9:06 AM
To: Steve Taylor 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Turn snapshot of a flattened snapshot into regular 
image

Thanks for the quick response, but I don't believe I'm there yet ;-)

> cloned the glance image to a cinder device

I have configured these three services (nova, glance, cinder) to use ceph as 
storage backend, but cinder is not involved in this process I'm referring to.

Now I wanted to reproduce this scenario to show a colleague, and couldn't 
because now I was able to delete image A even with a non-flattened snapshot! 
How is that even possible?

Eugen



Zitat von Steve Taylor :

> You're already there. When you booted ONE you cloned the glance image
> to a cinder device (A', separate RBD) that was a COW clone of A.
> That's why you can't delete A until you flatten SNAP1. A' isn't a full
> copy until that flatten is complete, at which point you're able to
> delete A.
>
> SNAP2 is a second snapshot on A', and thus A' already has all of the
> data it needs from the previous flatten of SNAP1 to allow you to
> delete SNAP1. So SNAP2 isn't actually a full extra copy of the data.
>
>
> 
>
> [cid:imagef01287.JPG@753835fa.45a0b2c0]
>Steve Taylor | Senior Software Engineer | StorageCraft Technology
> Corporation
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2799
>
> 
>
> If you are not the intended recipient of this message or received it
> erroneously, please notify the sender and delete it, together with
> any attachments, and be advised that any dissemination or copying of
> this message is prohibited.
>
> 
>
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> Behalf Of Eugen Block
> Sent: Thursday, September 1, 2016 6:51 AM
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Turn snapshot of a flattened snapshot into
> regular image
>
> Hi all,
>
> I'm trying to understand the idea behind rbd images and their
> clones/snapshots. I have tried this scenario:
>
> 1. upload image A to glance
> 2. boot instance ONE from image A
> 3. make changes to instance ONE (install new package) 4. create
> snapshot SNAP1 from ONE 5. delete instance ONE 6. delete image A
>deleting image A fails because of existing snapshot SNAP1 7.
> flatten snapshot SNAP1 8. delete image A
>succeeds
> 9. launch instance TWO from SNAP1
> 10. make changes to TWO (install package) 11. create snapshot SNAP2
> from TWO 12. delete TWO 13. delete SNAP1
> succeeds
>
> This means that the second snapshot has the same (full) size as the
> first. Can I manipulate SNAP1 somehow so that snapshots are not
> flattened anymore and SNAP2 becomes a cow clone of SNAP1?
>
> I hope my description is not 

Re: [ceph-users] Auto recovering after loosing all copies of a PG(s)

2016-09-01 Thread Iain Buclaw
On 16 August 2016 at 17:13, Wido den Hollander  wrote:
>
>> Op 16 augustus 2016 om 15:59 schreef Iain Buclaw :
>>
>>
>> The desired behaviour for me would be for the client to get an instant
>> "not found" response from stat() operations.  For write() to recreate
>> unfound objects.  And for missing placement groups to be recreated on
>> an OSD that isn't overloaded.  Halting the entire cluster when 96% of
>> it can still be accessed is just not workable, I'm afraid.
>>
>
> Well, you can't make Ceph do that, but you can make librados do such a thing.
>
> I'm using the OSD and MON timeout settings in libvirt for example: 
> http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/storage/storage_backend_rbd.c;h=9665fbca3a18fbfc7e4caec3ee8e991e13513275;hb=HEAD#l157
>
> You can set these options:
> - client_mount_timeout
> - rados_mon_op_timeout
> - rados_osd_op_timeout
>
> Where I think only the last two should be sufficient in your case.
>
> You wel get ETIMEDOUT back as error when a operation times out.
>
> Wido
>

This seems to be fine.

Now what to do when a DR situation happens.


  pgmap v592589: 4096 pgs, 1 pools, 1889 GB data, 244 Mobjects
2485 GB used, 10691 GB / 13263 GB avail
3902 active+clean
 128 creating
  66 incomplete


These PGs just never seem to finish creating.

-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Turn snapshot of a flattened snapshot into regular image

2016-09-01 Thread Eugen Block

Thanks for the quick response, but I don't believe I'm there yet ;-)


cloned the glance image to a cinder device


I have configured these three services (nova, glance, cinder) to use  
ceph as storage backend, but cinder is not involved in this process  
I'm referring to.


Now I wanted to reproduce this scenario to show a colleague, and  
couldn't because now I was able to delete image A even with a  
non-flattened snapshot! How is that even possible?


Eugen



Zitat von Steve Taylor :

You're already there. When you booted ONE you cloned the glance  
image to a cinder device (A', separate RBD) that was a COW clone of  
A. That's why you can't delete A until you flatten SNAP1. A' isn't a  
full copy until that flatten is complete, at which point you're able  
to delete A.


SNAP2 is a second snapshot on A', and thus A' already has all of the  
data it needs from the previous flatten of SNAP1 to allow you to  
delete SNAP1. So SNAP2 isn't actually a full extra copy of the data.





[cid:imagef01287.JPG@753835fa.45a0b2c0] 
   Steve Taylor | Senior Software Engineer | StorageCraft Technology  
Corporation

380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799



If you are not the intended recipient of this message or received it  
erroneously, please notify the sender and delete it, together with  
any attachments, and be advised that any dissemination or copying of  
this message is prohibited.




-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On  
Behalf Of Eugen Block

Sent: Thursday, September 1, 2016 6:51 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Turn snapshot of a flattened snapshot into  
regular image


Hi all,

I'm trying to understand the idea behind rbd images and their  
clones/snapshots. I have tried this scenario:


1. upload image A to glance
2. boot instance ONE from image A
3. make changes to instance ONE (install new package) 4. create  
snapshot SNAP1 from ONE 5. delete instance ONE 6. delete image A
   deleting image A fails because of existing snapshot SNAP1 7.  
flatten snapshot SNAP1 8. delete image A

   succeeds
9. launch instance TWO from SNAP1
10. make changes to TWO (install package) 11. create snapshot SNAP2  
from TWO 12. delete TWO 13. delete SNAP1

succeeds

This means that the second snapshot has the same (full) size as the  
first. Can I manipulate SNAP1 somehow so that snapshots are not  
flattened anymore and SNAP2 becomes a cow clone of SNAP1?


I hope my description is not too confusing. The idea behind this  
question is, if I have one base image and want to adjust that image  
from time to time, I don't want to keep several versions of that  
image, I just want one. But this way i would lose the protection  
from deleting the base image.


Is there any config option in ceph or Openstack or anything else I  
can do to "un-flatten" an image? I would assume that there is some  
kind of flag set for that image. Maybe someone can point me to the  
right direction.


Thanks,
Eugen

--
Eugen Block voice   : +49-40-559 51 75
NDE Netzdesign und -entwicklung AG  fax : +49-40-559 51 77
Postfach 61 03 15
D-22423 Hamburg e-mail  : ebl...@nde.ag

Vorsitzende des Aufsichtsrates: Angelika Mozdzen
  Sitz und Registergericht: Hamburg, HRB 90934
  Vorstand: Jens-U. Mozdzen
   USt-IdNr. DE 814 013 983

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Eugen Block voice   : +49-40-559 51 75
NDE Netzdesign und -entwicklung AG  fax : +49-40-559 51 77
Postfach 61 03 15
D-22423 Hamburg e-mail  : ebl...@nde.ag

Vorsitzende des Aufsichtsrates: Angelika Mozdzen
  Sitz und Registergericht: Hamburg, HRB 90934
  Vorstand: Jens-U. Mozdzen
   USt-IdNr. DE 814 013 983

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs page cache

2016-09-01 Thread Sean Redmond
Hi,

It seems to be using syscall mmap() from what I read this indicates it is
using memory-mapped IO.

Please see a strace here: http://pastebin.com/6wjhSNrP

Thanks

On Wed, Aug 31, 2016 at 5:51 PM, Sean Redmond 
wrote:

> I am not sure how to tell?
>
> Server1 and Server2 mount the ceph file system using kernel client 4.7.2
> and I can replicate the problem using '/usr/bin/sum' to read the file or a
> http GET request via a web server (apache).
>
> On Wed, Aug 31, 2016 at 2:38 PM, Yan, Zheng  wrote:
>
>> On Wed, Aug 31, 2016 at 12:49 AM, Sean Redmond 
>> wrote:
>> > Hi,
>> >
>> > I have been able to pick through the process a little further and
>> replicate
>> > it via the command line. The flow seems looks like this:
>> >
>> > 1) The user uploads an image to webserver server 'uploader01' it gets
>> > written to a path such as '/cephfs/webdata/static/456/JH
>> L/66448H-755h.jpg'
>> > on cephfs
>> >
>> > 2) The MDS makes the file meta data available for this new file
>> immediately
>> > to all clients.
>> >
>> > 3) The 'uploader01' server asynchronously commits the file contents to
>> disk
>> > as sync is not explicitly called during the upload.
>> >
>> > 4) Before step 3 is done the visitor requests the file via one of two
>> web
>> > servers server1 or server2 - the MDS provides the meta data but the
>> contents
>> > of the file is not committed to disk yet so the data read returns 0's -
>> This
>> > is then cached by the file system page cache until it expires or is
>> flushed
>> > manually.
>>
>> do server1 or server2 use memory-mapped IO to read the file?
>>
>> Regards
>> Yan, Zheng
>>
>> >
>> > 5) As step 4 typically only happens on one of the two web servers before
>> > step 3 is complete we get the mismatch between server1 and server2 file
>> > system page cache.
>> >
>> > The below demonstrates how to reproduce this issue
>> >
>> > http://pastebin.com/QK8AemAb
>> >
>> > As we can see the checksum of the file returned by the web server is 0
>> as
>> > the file contents has not been flushed to disk from server uploader01
>> >
>> > If however we call ‘sync’ as shown below the checksum is correct:
>> >
>> > http://pastebin.com/p4CfhEFt
>> >
>> > If we also wait for 10 seconds for the kernel to flush the dirty pages,
>> we
>> > can also see the checksum is valid:
>> >
>> > http://pastebin.com/1w6UZzNQ
>> >
>> > It looks it maybe a race between the time it takes the uploader01
>> server to
>> > commit the file to the file system and the fast incoming read request
>> from
>> > the visiting user to server1 or server2.
>> >
>> > Thanks
>> >
>> >
>> > On Tue, Aug 30, 2016 at 10:21 AM, Sean Redmond > >
>> > wrote:
>> >>
>> >> You are correct it only seems to impact recently modified files.
>> >>
>> >> On Tue, Aug 30, 2016 at 3:36 AM, Yan, Zheng  wrote:
>> >>>
>> >>> On Tue, Aug 30, 2016 at 2:11 AM, Gregory Farnum 
>> >>> wrote:
>> >>> > On Mon, Aug 29, 2016 at 7:14 AM, Sean Redmond <
>> sean.redmo...@gmail.com>
>> >>> > wrote:
>> >>> >> Hi,
>> >>> >>
>> >>> >> I am running cephfs (10.2.2) with kernel 4.7.0-1. I have noticed
>> that
>> >>> >> frequently static files are showing empty when serviced via a web
>> >>> >> server
>> >>> >> (apache). I have tracked this down further and can see when
>> running a
>> >>> >> checksum against the file on the cephfs file system on the node
>> >>> >> serving the
>> >>> >> empty http response the checksum is '0'
>> >>> >>
>> >>> >> The below shows the checksum on a defective node.
>> >>> >>
>> >>> >> [root@server2]# ls -al /cephfs/webdata/static/456/JHL
>> /66448H-755h.jpg
>> >>> >> -rw-r--r-- 1 apache apache 53317 Aug 28 23:46
>> >>> >> /cephfs/webdata/static/456/JHL/66448H-755h.jpg
>> >>>
>> >>> It seems this file was modified recently. Maybe the web server
>> >>> silently modifies the files. Please check if this issue happens on
>> >>> older files.
>> >>>
>> >>> Regards
>> >>> Yan, Zheng
>> >>>
>> >>> >>
>> >>> >> [root@server2]# sum /cephfs/webdata/static/456/JHL/66448H-755h.jpg
>> >>> >> 053
>> >>> >
>> >>> > So can we presume there are no file contents, and it's just 53
>> blocks
>> >>> > of zeros?
>> >>> >
>> >>> > This doesn't sound familiar to me; Zheng, do you have any ideas?
>> >>> > Anyway, ceph-fuse shouldn't be susceptible to this bug even with the
>> >>> > page cache enabled; if you're just serving stuff via the web it's
>> >>> > probably a better idea anyway (harder to break, easier to update,
>> >>> > etc).
>> >>> > -Greg
>> >>> >
>> >>> >>
>> >>> >> The below shows the checksum on a working node.
>> >>> >>
>> >>> >> [root@server1]# ls -al /cephfs/webdata/static/456/JHL
>> /66448H-755h.jpg
>> >>> >> -rw-r--r-- 1 apache apache 53317 Aug 28 23:46
>> >>> >> /cephfs/webdata/static/456/JHL/66448H-755h.jpg
>> >>> >>
>> >>> >> [root@server1]# sum /cephfs/webdata/static/456/JHL/66448H-755h.jpg
>> >>> >> 0362053
>> 

Re: [ceph-users] [Board] Ceph at OpenStack Barcelona

2016-09-01 Thread Dan Van Der Ster
Hi Patrick,

> On 01 Sep 2016, at 16:29, Patrick McGarry  wrote:
> 
> Hey cephers,
> 
> Now that our APAC roadshow has concluded I’m starting to look forward
> to upcoming events like OpenStack Barcelona. There were a ton of talks
> submitted this time around, so many of you did not get your talk
> accepted. You can see the 8 accepted talks here:
> 
> https://www.openstack.org/summit/barcelona-2016/summit-schedule/global-search?t=ceph
> 
> I am in the process of trying to put together an ancillary event on
> either Tues or Wed that week, but I am having trouble finding
> affordable space. If any of you would like to present at a side event

I have new material to present at a side event, and of course would be very 
interested to meet up with other Cephers at the summit.

Though, I don't think we can help with sponsorship :(

Cheers, Dan

> (and/or sponsor the effort) please let me know. I am working with the
> OpenStack foundation to see what we can get going. Thanks.
> 
> 
> -- 
> 
> Best Regards,
> 
> Patrick McGarry
> Director Ceph Community || Red Hat
> http://ceph.com  ||  http://community.redhat.com
> @scuttlemonkey || @ceph
> ___
> Board mailing list
> bo...@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/board-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph at OpenStack Barcelona

2016-09-01 Thread Patrick McGarry
Hey cephers,

Now that our APAC roadshow has concluded I’m starting to look forward
to upcoming events like OpenStack Barcelona. There were a ton of talks
submitted this time around, so many of you did not get your talk
accepted. You can see the 8 accepted talks here:

https://www.openstack.org/summit/barcelona-2016/summit-schedule/global-search?t=ceph

I am in the process of trying to put together an ancillary event on
either Tues or Wed that week, but I am having trouble finding
affordable space. If any of you would like to present at a side event
(and/or sponsor the effort) please let me know. I am working with the
OpenStack foundation to see what we can get going. Thanks.


-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph warning

2016-09-01 Thread Ishmael Tsoaela
I did set configure the following during my initial setup:

osd pool default size = 3



root@nodeC:/mnt/vmimages# ceph osd dump | grep "replicated size"
pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash
rjenkins pg_num 64 pgp_num 64 last_change 217 flags hashpspool
stripe_width 0
pool 4 'vmimages' replicated size 3 min_size 2 crush_ruleset 0
object_hash rjenkins pg_num 64 pgp_num 64 last_change 242 flags
hashpspool stripe_width 0
pool 5 'vmimage-backups' replicated size 3 min_size 2 crush_ruleset 0
object_hash rjenkins pg_num 512 pgp_num 512 last_change 777 flags
hashpspool stripe_width 0


After adding 3 more osd, I see data is being replicated to the new osd
:) and near full osd warning is gone.

recovery some hours ago:


>>  recovery 389973/3096070 objects degraded (12.596%)
>>  recovery 1258984/3096070 objects misplaced (40.664%)

recovery now:

recovery 8917/3217724 objects degraded (0.277%)
recovery 1120479/3217724 objects misplaced (34.822%)





On Thu, Sep 1, 2016 at 4:13 PM, Christian Balzer  wrote:
>
> Hello,
>
> On Thu, 1 Sep 2016 14:00:53 +0200 Ishmael Tsoaela wrote:
>
>> more questions and I hope you don;t mind:
>>
>>
>>
>> My understanding is that if I have 3 hosts with 5 osd each, 1 host
>> goes down, Ceph should not replicate to the osd that are down.
>>
> How could it replicate to something that is down?
>
>> When the host comes up, only then the replication will commence right?
>>
> Depends on your configuration.
>
>> If only 1 osd out of 5 comes up, then only data meant for that osd
>> should be copied to the osd? if so then why do pg get full if they
>> were not full before osd went down?
>>
> Definitely not.
>
>>
> You need to understand how  CRUSH maps, rules and replication work.
>
> By default pools with Hammer and higher with will have a replicate size
> of 3 and CRUSH picks OSDs based on a host failure domain, so that's why
> you need at least 3 hosts with those default settings.
>
> So with these defaults Ceph would indeed have done nothing in a 3 node
> cluster if one node had gone down.
> It needs to put replicas on different nodes, but only 2 are available.
>
> However given what happened to your cluster it is obvious that your pools
> have a replication size of 2 most likely.
> Check with
> ceph osd dump | grep "replicated size"
>
> In that case Ceph will try to recover and restore 2 replicas (original and
> copy), resulting in what you're seeing.
>
> Christian
>
>>
>> On Thu, Sep 1, 2016 at 1:29 PM, Ishmael Tsoaela  wrote:
>> > Thank you again.
>> >
>> > I will add 3 more osd today and leave untouched, maybe over weekend.
>> >
>> > On Thu, Sep 1, 2016 at 1:16 PM, Christian Balzer  wrote:
>> >>
>> >> Hello,
>> >>
>> >> On Thu, 1 Sep 2016 11:20:33 +0200 Ishmael Tsoaela wrote:
>> >>
>> >>> thanks for the response
>> >>>
>> >>>
>> >>>
>> >>> > You really will want to spend more time reading documentation and this 
>> >>> > ML,
>> >>> > as well as using google to (re-)search things.
>> >>>
>> >>>
>> >>>  I did do some reading on the error but cannot understand why they do
>> >>> not clear even after so long.
>> >>>
>> >>> > In your previous mail you already mentioned a 92% full OSD, that should
>> >>> > combined with the various "full" warnings have impressed on you the 
>> >>> > need
>> >>> > to address this issue.
>> >>>
>> >>> > When your nodes all rebooted, did everything come back up?
>> >>>
>> >>> One host with 5 osd were down nad came up later.
>> >>>
>> >>> > And if so (as the 15 osds: 15 up, 15 in suggest), how much separated in
>> >>> time?
>> >>>
>> >>> about 7 hours
>> >>>
>> >>> > And if so (as the 15 osds: 15 up, 15 in suggest), how much separated in
>> >>> time?   about 7 hours
>> >>>
>> >> OK, so in that 7 hours (with 1/3rd of your cluster down), Ceph tried to
>> >> restore redundancy, but had not enough space to do so and got itself stuck
>> >> in a corner.
>> >>
>> >> Lesson here is:
>> >> a) have enough space to cover the loss of one node (rack, etc) or
>> >> b) set "mon_osd_down_out_subtree_limit = host" in your case, so that you
>> >> can recover a failed node before re-balancing starts.
>> >>
>> >> Of course b) assumes that you have 24/7 monitoring and access to your
>> >> cluster, so that restoring a failed node is likely faster that
>> >> re-balancing the data.
>> >>
>> >>
>> >>> True
>> >>>
>> >>> > Bad, Ceph wants to place data onto these 2 PGs, but their OSDs are too
>> >>> > full for that.
>> >>> > And until something changes it will be stuck there.
>> >>> > Your best bet is to add more OSDs, since you seem to be short on space
>> >>> > anyway. Or delete unneeded data.
>> >>> > Given your level of experience, I'd advice against playing with weights
>> >>> > and the respective "full" configuration options.
>> >>>
>> >>> I did reweights some osd but everything is back to normal. No config
>> >>> changes on "Full" config
>> >>>
>> >>> I deleted about 900G 

Re: [ceph-users] ceph warning

2016-09-01 Thread Christian Balzer

Hello,

On Thu, 1 Sep 2016 14:00:53 +0200 Ishmael Tsoaela wrote:

> more questions and I hope you don;t mind:
> 
> 
> 
> My understanding is that if I have 3 hosts with 5 osd each, 1 host
> goes down, Ceph should not replicate to the osd that are down.
> 
How could it replicate to something that is down?

> When the host comes up, only then the replication will commence right?
>
Depends on your configuration.
 
> If only 1 osd out of 5 comes up, then only data meant for that osd
> should be copied to the osd? if so then why do pg get full if they
> were not full before osd went down?
>
Definitely not.
 
> 
You need to understand how  CRUSH maps, rules and replication work.

By default pools with Hammer and higher with will have a replicate size
of 3 and CRUSH picks OSDs based on a host failure domain, so that's why
you need at least 3 hosts with those default settings.

So with these defaults Ceph would indeed have done nothing in a 3 node
cluster if one node had gone down.
It needs to put replicas on different nodes, but only 2 are available. 

However given what happened to your cluster it is obvious that your pools
have a replication size of 2 most likely.
Check with 
ceph osd dump | grep "replicated size"

In that case Ceph will try to recover and restore 2 replicas (original and
copy), resulting in what you're seeing.

Christian

> 
> On Thu, Sep 1, 2016 at 1:29 PM, Ishmael Tsoaela  wrote:
> > Thank you again.
> >
> > I will add 3 more osd today and leave untouched, maybe over weekend.
> >
> > On Thu, Sep 1, 2016 at 1:16 PM, Christian Balzer  wrote:
> >>
> >> Hello,
> >>
> >> On Thu, 1 Sep 2016 11:20:33 +0200 Ishmael Tsoaela wrote:
> >>
> >>> thanks for the response
> >>>
> >>>
> >>>
> >>> > You really will want to spend more time reading documentation and this 
> >>> > ML,
> >>> > as well as using google to (re-)search things.
> >>>
> >>>
> >>>  I did do some reading on the error but cannot understand why they do
> >>> not clear even after so long.
> >>>
> >>> > In your previous mail you already mentioned a 92% full OSD, that should
> >>> > combined with the various "full" warnings have impressed on you the need
> >>> > to address this issue.
> >>>
> >>> > When your nodes all rebooted, did everything come back up?
> >>>
> >>> One host with 5 osd were down nad came up later.
> >>>
> >>> > And if so (as the 15 osds: 15 up, 15 in suggest), how much separated in
> >>> time?
> >>>
> >>> about 7 hours
> >>>
> >>> > And if so (as the 15 osds: 15 up, 15 in suggest), how much separated in
> >>> time?   about 7 hours
> >>>
> >> OK, so in that 7 hours (with 1/3rd of your cluster down), Ceph tried to
> >> restore redundancy, but had not enough space to do so and got itself stuck
> >> in a corner.
> >>
> >> Lesson here is:
> >> a) have enough space to cover the loss of one node (rack, etc) or
> >> b) set "mon_osd_down_out_subtree_limit = host" in your case, so that you
> >> can recover a failed node before re-balancing starts.
> >>
> >> Of course b) assumes that you have 24/7 monitoring and access to your
> >> cluster, so that restoring a failed node is likely faster that
> >> re-balancing the data.
> >>
> >>
> >>> True
> >>>
> >>> > Bad, Ceph wants to place data onto these 2 PGs, but their OSDs are too
> >>> > full for that.
> >>> > And until something changes it will be stuck there.
> >>> > Your best bet is to add more OSDs, since you seem to be short on space
> >>> > anyway. Or delete unneeded data.
> >>> > Given your level of experience, I'd advice against playing with weights
> >>> > and the respective "full" configuration options.
> >>>
> >>> I did reweights some osd but everything is back to normal. No config
> >>> changes on "Full" config
> >>>
> >>> I deleted about 900G this morning and prepared 3 osd, should I add them 
> >>> now?
> >>>
> >> More OSDs will both make things less likely to get full again and give the
> >> nearfull OSDs a place to move data to.
> >>
> >> However they will also cause more data movement, so if your cluster is
> >> busy, maybe do that during the night or weekend.
> >>
> >>> > Are these numbers and the recovery io below still changing, moving 
> >>> > along?
> >>>
> >>> original email:
> >>>
> >>> > recovery 493335/3099981 objects degraded (15.914%)
> >>> > recovery 1377464/3099981 objects misplaced (44.435%)
> >>>
> >>>
> >>> current email:
> >>>
> >>>
> >>>  recovery 389973/3096070 objects degraded (12.596%)
> >>>  recovery 1258984/3096070 objects misplaced (40.664%)
> >>>
> >>>
> >> So there is progress, it may recover by itself after all.
> >>
> >> Looking at your "df" output only 7 OSDs seem to be nearfull now, is that
> >> correct?
> >>
> >> If so definitely progress, it's just taking a lot of time to recover.
> >>
> >> If the progress should stop before the cluster can get healthy again,
> >> write another mail with "ceph -s" and so forth for us to peruse.
> >>
> >> Christian
> >>
> >>> > Just to confirm, 

Re: [ceph-users] Turn snapshot of a flattened snapshot into regular image

2016-09-01 Thread Steve Taylor
You're already there. When you booted ONE you cloned the glance image to a 
cinder device (A', separate RBD) that was a COW clone of A. That's why you 
can't delete A until you flatten SNAP1. A' isn't a full copy until that flatten 
is complete, at which point you're able to delete A.

SNAP2 is a second snapshot on A', and thus A' already has all of the data it 
needs from the previous flatten of SNAP1 to allow you to delete SNAP1. So SNAP2 
isn't actually a full extra copy of the data.




[cid:imagef01287.JPG@753835fa.45a0b2c0]   Steve 
Taylor | Senior Software Engineer | StorageCraft Technology 
Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799



If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.



-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Eugen 
Block
Sent: Thursday, September 1, 2016 6:51 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Turn snapshot of a flattened snapshot into regular image

Hi all,

I'm trying to understand the idea behind rbd images and their clones/snapshots. 
I have tried this scenario:

1. upload image A to glance
2. boot instance ONE from image A
3. make changes to instance ONE (install new package) 4. create snapshot SNAP1 
from ONE 5. delete instance ONE 6. delete image A
   deleting image A fails because of existing snapshot SNAP1 7. flatten 
snapshot SNAP1 8. delete image A
   succeeds
9. launch instance TWO from SNAP1
10. make changes to TWO (install package) 11. create snapshot SNAP2 from TWO 
12. delete TWO 13. delete SNAP1
succeeds

This means that the second snapshot has the same (full) size as the first. Can 
I manipulate SNAP1 somehow so that snapshots are not flattened anymore and 
SNAP2 becomes a cow clone of SNAP1?

I hope my description is not too confusing. The idea behind this question is, 
if I have one base image and want to adjust that image from time to time, I 
don't want to keep several versions of that image, I just want one. But this 
way i would lose the protection from deleting the base image.

Is there any config option in ceph or Openstack or anything else I can do to 
"un-flatten" an image? I would assume that there is some kind of flag set for 
that image. Maybe someone can point me to the right direction.

Thanks,
Eugen

--
Eugen Block voice   : +49-40-559 51 75
NDE Netzdesign und -entwicklung AG  fax : +49-40-559 51 77
Postfach 61 03 15
D-22423 Hamburg e-mail  : ebl...@nde.ag

Vorsitzende des Aufsichtsrates: Angelika Mozdzen
  Sitz und Registergericht: Hamburg, HRB 90934
  Vorstand: Jens-U. Mozdzen
   USt-IdNr. DE 814 013 983

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Turn snapshot of a flattened snapshot into regular image

2016-09-01 Thread Eugen Block

Hi all,

I'm trying to understand the idea behind rbd images and their  
clones/snapshots. I have tried this scenario:


1. upload image A to glance
2. boot instance ONE from image A
3. make changes to instance ONE (install new package)
4. create snapshot SNAP1 from ONE
5. delete instance ONE
6. delete image A
   deleting image A fails because of existing snapshot SNAP1
7. flatten snapshot SNAP1
8. delete image A
   succeeds
9. launch instance TWO from SNAP1
10. make changes to TWO (install package)
11. create snapshot SNAP2 from TWO
12. delete TWO
13. delete SNAP1
succeeds

This means that the second snapshot has the same (full) size as the  
first. Can I manipulate SNAP1 somehow so that snapshots are not  
flattened anymore and SNAP2 becomes a cow clone of SNAP1?


I hope my description is not too confusing. The idea behind this  
question is, if I have one base image and want to adjust that image  
from time to time, I don't want to keep several versions of that  
image, I just want one. But this way i would lose the protection from  
deleting the base image.


Is there any config option in ceph or Openstack or anything else I can  
do to "un-flatten" an image? I would assume that there is some kind of  
flag set for that image. Maybe someone can point me to the right  
direction.


Thanks,
Eugen

--
Eugen Block voice   : +49-40-559 51 75
NDE Netzdesign und -entwicklung AG  fax : +49-40-559 51 77
Postfach 61 03 15
D-22423 Hamburg e-mail  : ebl...@nde.ag

Vorsitzende des Aufsichtsrates: Angelika Mozdzen
  Sitz und Registergericht: Hamburg, HRB 90934
  Vorstand: Jens-U. Mozdzen
   USt-IdNr. DE 814 013 983

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph warning

2016-09-01 Thread Ishmael Tsoaela
more questions and I hope you don;t mind:



My understanding is that if I have 3 hosts with 5 osd each, 1 host
goes down, Ceph should not replicate to the osd that are down.

When the host comes up, only then the replication will commence right?

If only 1 osd out of 5 comes up, then only data meant for that osd
should be copied to the osd? if so then why do pg get full if they
were not full before osd went down?



On Thu, Sep 1, 2016 at 1:29 PM, Ishmael Tsoaela  wrote:
> Thank you again.
>
> I will add 3 more osd today and leave untouched, maybe over weekend.
>
> On Thu, Sep 1, 2016 at 1:16 PM, Christian Balzer  wrote:
>>
>> Hello,
>>
>> On Thu, 1 Sep 2016 11:20:33 +0200 Ishmael Tsoaela wrote:
>>
>>> thanks for the response
>>>
>>>
>>>
>>> > You really will want to spend more time reading documentation and this ML,
>>> > as well as using google to (re-)search things.
>>>
>>>
>>>  I did do some reading on the error but cannot understand why they do
>>> not clear even after so long.
>>>
>>> > In your previous mail you already mentioned a 92% full OSD, that should
>>> > combined with the various "full" warnings have impressed on you the need
>>> > to address this issue.
>>>
>>> > When your nodes all rebooted, did everything come back up?
>>>
>>> One host with 5 osd were down nad came up later.
>>>
>>> > And if so (as the 15 osds: 15 up, 15 in suggest), how much separated in
>>> time?
>>>
>>> about 7 hours
>>>
>>> > And if so (as the 15 osds: 15 up, 15 in suggest), how much separated in
>>> time?   about 7 hours
>>>
>> OK, so in that 7 hours (with 1/3rd of your cluster down), Ceph tried to
>> restore redundancy, but had not enough space to do so and got itself stuck
>> in a corner.
>>
>> Lesson here is:
>> a) have enough space to cover the loss of one node (rack, etc) or
>> b) set "mon_osd_down_out_subtree_limit = host" in your case, so that you
>> can recover a failed node before re-balancing starts.
>>
>> Of course b) assumes that you have 24/7 monitoring and access to your
>> cluster, so that restoring a failed node is likely faster that
>> re-balancing the data.
>>
>>
>>> True
>>>
>>> > Bad, Ceph wants to place data onto these 2 PGs, but their OSDs are too
>>> > full for that.
>>> > And until something changes it will be stuck there.
>>> > Your best bet is to add more OSDs, since you seem to be short on space
>>> > anyway. Or delete unneeded data.
>>> > Given your level of experience, I'd advice against playing with weights
>>> > and the respective "full" configuration options.
>>>
>>> I did reweights some osd but everything is back to normal. No config
>>> changes on "Full" config
>>>
>>> I deleted about 900G this morning and prepared 3 osd, should I add them now?
>>>
>> More OSDs will both make things less likely to get full again and give the
>> nearfull OSDs a place to move data to.
>>
>> However they will also cause more data movement, so if your cluster is
>> busy, maybe do that during the night or weekend.
>>
>>> > Are these numbers and the recovery io below still changing, moving along?
>>>
>>> original email:
>>>
>>> > recovery 493335/3099981 objects degraded (15.914%)
>>> > recovery 1377464/3099981 objects misplaced (44.435%)
>>>
>>>
>>> current email:
>>>
>>>
>>>  recovery 389973/3096070 objects degraded (12.596%)
>>>  recovery 1258984/3096070 objects misplaced (40.664%)
>>>
>>>
>> So there is progress, it may recover by itself after all.
>>
>> Looking at your "df" output only 7 OSDs seem to be nearfull now, is that
>> correct?
>>
>> If so definitely progress, it's just taking a lot of time to recover.
>>
>> If the progress should stop before the cluster can get healthy again,
>> write another mail with "ceph -s" and so forth for us to peruse.
>>
>> Christian
>>
>>> > Just to confirm, that's all the 15 OSDs your cluster ever had?
>>>
>>> yes
>>>
>>>
>>> > Output from "ceph osd df" and "ceph osd tree" please.
>>>
>>> ID WEIGHT  REWEIGHT SIZE   USEAVAIL %USE  VAR  PGS
>>>  3 0.90868  1.0   930G   232G  698G 24.96 0.40 105
>>>  5 0.90868  1.0   930G   139G  791G 14.99 0.24 139
>>>  6 0.90868  1.0   930G 61830M  870G  6.49 0.10 138
>>>  0 0.90868  1.0   930G   304G  625G 32.76 0.53 128
>>>  2 0.90868  1.0   930G 24253M  906G  2.55 0.04 130
>>>  1 0.90868  1.0   930G   793G  137G 85.22 1.37 162
>>>  4 0.90868  1.0   930G   790G  140G 84.91 1.36 160
>>>  7 0.90868  1.0   930G   803G  127G 86.34 1.39 144
>>> 10 0.90868  1.0   930G   792G  138G 85.16 1.37 145
>>> 13 0.90868  1.0   930G   811G  119G 87.17 1.40 163
>>> 15 0.90869  1.0   930G   794G  136G 85.37 1.37 157
>>> 16 0.90869  1.0   930G   757G  172G 81.45 1.31 159
>>> 17 0.90868  1.0   930G   800G  129G 86.06 1.38 144
>>> 18 0.90869  1.0   930G   786G  144G 84.47 1.36 166
>>> 19 0.90868  1.0   930G   793G  137G 85.26 1.37 160
>>>   TOTAL 13958G  8683G 5274G 62.21
>>> MIN/MAX VAR: 0.04/1.40  STDDEV: 33.10
>>>

Re: [ceph-users] ceph warning

2016-09-01 Thread Ishmael Tsoaela
Thank you again.

I will add 3 more osd today and leave untouched, maybe over weekend.

On Thu, Sep 1, 2016 at 1:16 PM, Christian Balzer  wrote:
>
> Hello,
>
> On Thu, 1 Sep 2016 11:20:33 +0200 Ishmael Tsoaela wrote:
>
>> thanks for the response
>>
>>
>>
>> > You really will want to spend more time reading documentation and this ML,
>> > as well as using google to (re-)search things.
>>
>>
>>  I did do some reading on the error but cannot understand why they do
>> not clear even after so long.
>>
>> > In your previous mail you already mentioned a 92% full OSD, that should
>> > combined with the various "full" warnings have impressed on you the need
>> > to address this issue.
>>
>> > When your nodes all rebooted, did everything come back up?
>>
>> One host with 5 osd were down nad came up later.
>>
>> > And if so (as the 15 osds: 15 up, 15 in suggest), how much separated in
>> time?
>>
>> about 7 hours
>>
>> > And if so (as the 15 osds: 15 up, 15 in suggest), how much separated in
>> time?   about 7 hours
>>
> OK, so in that 7 hours (with 1/3rd of your cluster down), Ceph tried to
> restore redundancy, but had not enough space to do so and got itself stuck
> in a corner.
>
> Lesson here is:
> a) have enough space to cover the loss of one node (rack, etc) or
> b) set "mon_osd_down_out_subtree_limit = host" in your case, so that you
> can recover a failed node before re-balancing starts.
>
> Of course b) assumes that you have 24/7 monitoring and access to your
> cluster, so that restoring a failed node is likely faster that
> re-balancing the data.
>
>
>> True
>>
>> > Bad, Ceph wants to place data onto these 2 PGs, but their OSDs are too
>> > full for that.
>> > And until something changes it will be stuck there.
>> > Your best bet is to add more OSDs, since you seem to be short on space
>> > anyway. Or delete unneeded data.
>> > Given your level of experience, I'd advice against playing with weights
>> > and the respective "full" configuration options.
>>
>> I did reweights some osd but everything is back to normal. No config
>> changes on "Full" config
>>
>> I deleted about 900G this morning and prepared 3 osd, should I add them now?
>>
> More OSDs will both make things less likely to get full again and give the
> nearfull OSDs a place to move data to.
>
> However they will also cause more data movement, so if your cluster is
> busy, maybe do that during the night or weekend.
>
>> > Are these numbers and the recovery io below still changing, moving along?
>>
>> original email:
>>
>> > recovery 493335/3099981 objects degraded (15.914%)
>> > recovery 1377464/3099981 objects misplaced (44.435%)
>>
>>
>> current email:
>>
>>
>>  recovery 389973/3096070 objects degraded (12.596%)
>>  recovery 1258984/3096070 objects misplaced (40.664%)
>>
>>
> So there is progress, it may recover by itself after all.
>
> Looking at your "df" output only 7 OSDs seem to be nearfull now, is that
> correct?
>
> If so definitely progress, it's just taking a lot of time to recover.
>
> If the progress should stop before the cluster can get healthy again,
> write another mail with "ceph -s" and so forth for us to peruse.
>
> Christian
>
>> > Just to confirm, that's all the 15 OSDs your cluster ever had?
>>
>> yes
>>
>>
>> > Output from "ceph osd df" and "ceph osd tree" please.
>>
>> ID WEIGHT  REWEIGHT SIZE   USEAVAIL %USE  VAR  PGS
>>  3 0.90868  1.0   930G   232G  698G 24.96 0.40 105
>>  5 0.90868  1.0   930G   139G  791G 14.99 0.24 139
>>  6 0.90868  1.0   930G 61830M  870G  6.49 0.10 138
>>  0 0.90868  1.0   930G   304G  625G 32.76 0.53 128
>>  2 0.90868  1.0   930G 24253M  906G  2.55 0.04 130
>>  1 0.90868  1.0   930G   793G  137G 85.22 1.37 162
>>  4 0.90868  1.0   930G   790G  140G 84.91 1.36 160
>>  7 0.90868  1.0   930G   803G  127G 86.34 1.39 144
>> 10 0.90868  1.0   930G   792G  138G 85.16 1.37 145
>> 13 0.90868  1.0   930G   811G  119G 87.17 1.40 163
>> 15 0.90869  1.0   930G   794G  136G 85.37 1.37 157
>> 16 0.90869  1.0   930G   757G  172G 81.45 1.31 159
>> 17 0.90868  1.0   930G   800G  129G 86.06 1.38 144
>> 18 0.90869  1.0   930G   786G  144G 84.47 1.36 166
>> 19 0.90868  1.0   930G   793G  137G 85.26 1.37 160
>>   TOTAL 13958G  8683G 5274G 62.21
>> MIN/MAX VAR: 0.04/1.40  STDDEV: 33.10
>>
>>
>>
>> ID WEIGHT   TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
>> -1 13.63019 root default
>> -2  4.54338 host nodeB
>>  3  0.90868 osd.3   up  1.0  1.0
>>  5  0.90868 osd.5   up  1.0  1.0
>>  6  0.90868 osd.6   up  1.0  1.0
>>  0  0.90868 osd.0   up  1.0  1.0
>>  2  0.90868 osd.2   up  1.0  1.0
>> -3  4.54338 host nodeC
>>  1  0.90868 osd.1   up  1.0  1.0
>>  4  0.90868 osd.4   up  1.0  1.0
>>  7  0.90868   

Re: [ceph-users] Slow Request on OSD

2016-09-01 Thread Cloud List
On Thu, Sep 1, 2016 at 3:50 PM, Nick Fisk  wrote:

> > > Op 31 augustus 2016 om 23:21 schreef Reed Dier  >:
> > >
> > >
> > > Multiple XFS corruptions, multiple leveldb issues. Looked to be result
> of write cache settings which have been adjusted now.
>
> Reed, I realise that you are probably very busy attempting recovery at the
> moment, but when things calm down, I think it would be very beneficial to
> the list if you could expand on what settings caused this to happen. It
> might just stop this happening to someone else in the future.
>

Agree with Nick, when things settle down and (hopefully) all the data is
recovered, appreciate if Reed can share what kinid of write cache settings
can cause this problem and what adjustment was made to prevent this kind of
problem from happening.

Thank you.

-ip-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph warning

2016-09-01 Thread Christian Balzer

Hello,

On Thu, 1 Sep 2016 11:20:33 +0200 Ishmael Tsoaela wrote:

> thanks for the response
> 
> 
> 
> > You really will want to spend more time reading documentation and this ML,
> > as well as using google to (re-)search things.
> 
> 
>  I did do some reading on the error but cannot understand why they do
> not clear even after so long.
> 
> > In your previous mail you already mentioned a 92% full OSD, that should
> > combined with the various "full" warnings have impressed on you the need
> > to address this issue.
> 
> > When your nodes all rebooted, did everything come back up?
> 
> One host with 5 osd were down nad came up later.
> 
> > And if so (as the 15 osds: 15 up, 15 in suggest), how much separated in
> time?
> 
> about 7 hours
> 
> > And if so (as the 15 osds: 15 up, 15 in suggest), how much separated in
> time?   about 7 hours
>
OK, so in that 7 hours (with 1/3rd of your cluster down), Ceph tried to
restore redundancy, but had not enough space to do so and got itself stuck
in a corner.

Lesson here is:
a) have enough space to cover the loss of one node (rack, etc) or
b) set "mon_osd_down_out_subtree_limit = host" in your case, so that you
can recover a failed node before re-balancing starts.

Of course b) assumes that you have 24/7 monitoring and access to your
cluster, so that restoring a failed node is likely faster that
re-balancing the data.

 
> True
> 
> > Bad, Ceph wants to place data onto these 2 PGs, but their OSDs are too
> > full for that.
> > And until something changes it will be stuck there.
> > Your best bet is to add more OSDs, since you seem to be short on space
> > anyway. Or delete unneeded data.
> > Given your level of experience, I'd advice against playing with weights
> > and the respective "full" configuration options.
> 
> I did reweights some osd but everything is back to normal. No config
> changes on "Full" config
> 
> I deleted about 900G this morning and prepared 3 osd, should I add them now?
> 
More OSDs will both make things less likely to get full again and give the
nearfull OSDs a place to move data to.

However they will also cause more data movement, so if your cluster is
busy, maybe do that during the night or weekend.

> > Are these numbers and the recovery io below still changing, moving along?
> 
> original email:
> 
> > recovery 493335/3099981 objects degraded (15.914%)
> > recovery 1377464/3099981 objects misplaced (44.435%)
> 
> 
> current email:
> 
> 
>  recovery 389973/3096070 objects degraded (12.596%)
>  recovery 1258984/3096070 objects misplaced (40.664%)
> 
> 
So there is progress, it may recover by itself after all.

Looking at your "df" output only 7 OSDs seem to be nearfull now, is that
correct? 

If so definitely progress, it's just taking a lot of time to recover.

If the progress should stop before the cluster can get healthy again,
write another mail with "ceph -s" and so forth for us to peruse.

Christian

> > Just to confirm, that's all the 15 OSDs your cluster ever had?
> 
> yes
> 
> 
> > Output from "ceph osd df" and "ceph osd tree" please.
> 
> ID WEIGHT  REWEIGHT SIZE   USEAVAIL %USE  VAR  PGS
>  3 0.90868  1.0   930G   232G  698G 24.96 0.40 105
>  5 0.90868  1.0   930G   139G  791G 14.99 0.24 139
>  6 0.90868  1.0   930G 61830M  870G  6.49 0.10 138
>  0 0.90868  1.0   930G   304G  625G 32.76 0.53 128
>  2 0.90868  1.0   930G 24253M  906G  2.55 0.04 130
>  1 0.90868  1.0   930G   793G  137G 85.22 1.37 162
>  4 0.90868  1.0   930G   790G  140G 84.91 1.36 160
>  7 0.90868  1.0   930G   803G  127G 86.34 1.39 144
> 10 0.90868  1.0   930G   792G  138G 85.16 1.37 145
> 13 0.90868  1.0   930G   811G  119G 87.17 1.40 163
> 15 0.90869  1.0   930G   794G  136G 85.37 1.37 157
> 16 0.90869  1.0   930G   757G  172G 81.45 1.31 159
> 17 0.90868  1.0   930G   800G  129G 86.06 1.38 144
> 18 0.90869  1.0   930G   786G  144G 84.47 1.36 166
> 19 0.90868  1.0   930G   793G  137G 85.26 1.37 160
>   TOTAL 13958G  8683G 5274G 62.21
> MIN/MAX VAR: 0.04/1.40  STDDEV: 33.10
> 
> 
> 
> ID WEIGHT   TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 13.63019 root default
> -2  4.54338 host nodeB
>  3  0.90868 osd.3   up  1.0  1.0
>  5  0.90868 osd.5   up  1.0  1.0
>  6  0.90868 osd.6   up  1.0  1.0
>  0  0.90868 osd.0   up  1.0  1.0
>  2  0.90868 osd.2   up  1.0  1.0
> -3  4.54338 host nodeC
>  1  0.90868 osd.1   up  1.0  1.0
>  4  0.90868 osd.4   up  1.0  1.0
>  7  0.90868 osd.7   up  1.0  1.0
> 10  0.90868 osd.10  up  1.0  1.0
> 13  0.90868 osd.13  up  1.0  1.0
> -6  4.54343 host nodeD
> 15  0.90869 osd.15  up  1.0  1.0
> 16  0.90869 

Re: [ceph-users] ceph journal system vs filesystem journal system

2016-09-01 Thread huang jun
2016-09-01 17:25 GMT+08:00 한승진 :
> Hi all.
>
> I'm very confused about ceph journal system
>
> Some people said ceph journal system works like linux journal filesystem.
>
> Also some people said all data are written journal first and then written to
> OSD data.
>
> Journal of Ceph storage also write just metadata of object or write all data
> of object?
>
> Which is right?
>

data writen to osd first will write to osd journal through dio, and
then submit to objectstore,
that will improve the small file write performance bc the journal
write is sequential not random,
and journal can recover the data that written to journal but didn't
write to objectstore yet, like outage..

> Thanks for your help
>
> Best regards.
>
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Thank you!
HuangJun
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph journal system vs filesystem journal system

2016-09-01 Thread 한승진
Hi all.

I'm very confused about ceph journal system

Some people said ceph journal system works like linux journal filesystem.

Also some people said all data are written journal first and then written
to OSD data.

Journal of Ceph storage also write just metadata of object or write all
data of object?

Which is right?

Thanks for your help

Best regards.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph warning

2016-09-01 Thread Ishmael Tsoaela
thanks for the response



> You really will want to spend more time reading documentation and this ML,
> as well as using google to (re-)search things.


 I did do some reading on the error but cannot understand why they do
not clear even after so long.

> In your previous mail you already mentioned a 92% full OSD, that should
> combined with the various "full" warnings have impressed on you the need
> to address this issue.

> When your nodes all rebooted, did everything come back up?

One host with 5 osd were down nad came up later.

> And if so (as the 15 osds: 15 up, 15 in suggest), how much separated in
time?

about 7 hours

> And if so (as the 15 osds: 15 up, 15 in suggest), how much separated in
time?   about 7 hours

True

> Bad, Ceph wants to place data onto these 2 PGs, but their OSDs are too
> full for that.
> And until something changes it will be stuck there.
> Your best bet is to add more OSDs, since you seem to be short on space
> anyway. Or delete unneeded data.
> Given your level of experience, I'd advice against playing with weights
> and the respective "full" configuration options.

I did reweights some osd but everything is back to normal. No config
changes on "Full" config

I deleted about 900G this morning and prepared 3 osd, should I add them now?

> Are these numbers and the recovery io below still changing, moving along?

original email:

> recovery 493335/3099981 objects degraded (15.914%)
> recovery 1377464/3099981 objects misplaced (44.435%)


current email:


 recovery 389973/3096070 objects degraded (12.596%)
 recovery 1258984/3096070 objects misplaced (40.664%)


> Just to confirm, that's all the 15 OSDs your cluster ever had?

yes


> Output from "ceph osd df" and "ceph osd tree" please.

ID WEIGHT  REWEIGHT SIZE   USEAVAIL %USE  VAR  PGS
 3 0.90868  1.0   930G   232G  698G 24.96 0.40 105
 5 0.90868  1.0   930G   139G  791G 14.99 0.24 139
 6 0.90868  1.0   930G 61830M  870G  6.49 0.10 138
 0 0.90868  1.0   930G   304G  625G 32.76 0.53 128
 2 0.90868  1.0   930G 24253M  906G  2.55 0.04 130
 1 0.90868  1.0   930G   793G  137G 85.22 1.37 162
 4 0.90868  1.0   930G   790G  140G 84.91 1.36 160
 7 0.90868  1.0   930G   803G  127G 86.34 1.39 144
10 0.90868  1.0   930G   792G  138G 85.16 1.37 145
13 0.90868  1.0   930G   811G  119G 87.17 1.40 163
15 0.90869  1.0   930G   794G  136G 85.37 1.37 157
16 0.90869  1.0   930G   757G  172G 81.45 1.31 159
17 0.90868  1.0   930G   800G  129G 86.06 1.38 144
18 0.90869  1.0   930G   786G  144G 84.47 1.36 166
19 0.90868  1.0   930G   793G  137G 85.26 1.37 160
  TOTAL 13958G  8683G 5274G 62.21
MIN/MAX VAR: 0.04/1.40  STDDEV: 33.10



ID WEIGHT   TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 13.63019 root default
-2  4.54338 host nodeB
 3  0.90868 osd.3   up  1.0  1.0
 5  0.90868 osd.5   up  1.0  1.0
 6  0.90868 osd.6   up  1.0  1.0
 0  0.90868 osd.0   up  1.0  1.0
 2  0.90868 osd.2   up  1.0  1.0
-3  4.54338 host nodeC
 1  0.90868 osd.1   up  1.0  1.0
 4  0.90868 osd.4   up  1.0  1.0
 7  0.90868 osd.7   up  1.0  1.0
10  0.90868 osd.10  up  1.0  1.0
13  0.90868 osd.13  up  1.0  1.0
-6  4.54343 host nodeD
15  0.90869 osd.15  up  1.0  1.0
16  0.90869 osd.16  up  1.0  1.0
17  0.90868 osd.17  up  1.0  1.0
18  0.90869 osd.18  up  1.0  1.0
19  0.90868 osd.19  up  1.0  1.0

























On Thu, Sep 1, 2016 at 10:56 AM, Christian Balzer  wrote:
>
>
> Hello,
>
> On Thu, 1 Sep 2016 10:18:39 +0200 Ishmael Tsoaela wrote:
>
> > Hi All,
> >
> > Can someone please decipher this errors for me, after all nodes rebooted in
> > my cluster on Monday. the warning has not gone.
> >
> You really will want to spend more time reading documentation and this ML,
> as well as using google to (re-)search things.
> Like searching for "backfill_toofull", "near full", etc.
>
>
> > Will the warning ever clear?
> >
> Unlikely.
>
> In your previous mail you already mentioned a 92% full OSD, that should
> combined with the various "full" warnings have impressed on you the need
> to address this issue.
>
> When your nodes all rebooted, did everything come back up?
> And if so (as the 15 osds: 15 up, 15 in suggest), how much separated in
> time?
> My guess is that some nodes/OSDs where restarted a lot later than others.
>
> See inline:
> >
> >   cluster df3f96d8-3889-4baa-8b27-cc2839141425
> >  health HEALTH_WARN
> > 2 pgs backfill_toofull
> Bad, Ceph wants to place data onto these 2 PGs, but their OSDs are too
> 

[ceph-users] RadosGW zonegroup id error

2016-09-01 Thread Yoann Moulin
Hello,

I have an issue with the default zonegroup on my cluster (Jewel 10.2.2), I don't
know when this occured, but I think I did a wrong command during the
manipulation of zones and regions. Now the ID of my zonegroup is "default"
instead of "4d982760-7853-4174-8c05-cec2ef148cf0", I cannot update zones or
regions anymore.

Is that possible to change the ID of the zonegroup, I try to update the json
then set the zonegroup but it doesn't work (certainly because it's not the same
ID...)

see below the zonegroup and zonegroup-map metadata

$ radosgw-admin zonegroup get
{
"id": "default",
"name": "default",
"api_name": "",
"is_master": "true",
"endpoints": [],
"hostnames": [],
"hostnames_s3website": [],
"master_zone": "",
"zones": [
{
"id": "default",
"name": "default",
"endpoints": [],
"log_meta": "false",
"log_data": "false",
"bucket_index_max_shards": 0,
"read_only": "false"
}
],
"placement_targets": [
{
"name": "default-placement",
"tags": []
}
],
"default_placement": "default-placement",
"realm_id": "ccc2e663-66d3-49a6-9e3a-f257785f2d9a"
}

$ radosgw-admin zonegroup-map get
{
"zonegroups": [
{
"key": "4d982760-7853-4174-8c05-cec2ef148cf0",
"val": {
"id": "4d982760-7853-4174-8c05-cec2ef148cf0",
"name": "default",
"api_name": "",
"is_master": "true",
"endpoints": [],
"hostnames": [],
"hostnames_s3website": [],
"master_zone": "c9724aff-5fa0-4dd9-b494-57bdb48fab4e",
"zones": [
{
"id": "c9724aff-5fa0-4dd9-b494-57bdb48fab4e",
"name": "default",
"endpoints": [],
"log_meta": "false",
"log_data": "false",
"bucket_index_max_shards": 0,
"read_only": "false"
}
],
"placement_targets": [
{
"name": "default-placement",
"tags": []
}
],
"default_placement": "default-placement",
"realm_id": "ccc2e663-66d3-49a6-9e3a-f257785f2d9a"
}
}
],
"master_zonegroup": "4d982760-7853-4174-8c05-cec2ef148cf0",
"bucket_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
},
"user_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
}
}

Thanks for your help,

Best regards

-- 
Yoann Moulin
EPFL IC-IT
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph warning

2016-09-01 Thread Christian Balzer

Hello,

On Thu, 1 Sep 2016 10:18:39 +0200 Ishmael Tsoaela wrote:

> Hi All,
> 
> Can someone please decipher this errors for me, after all nodes rebooted in
> my cluster on Monday. the warning has not gone.
>
You really will want to spend more time reading documentation and this ML,
as well as using google to (re-)search things.
Like searching for "backfill_toofull", "near full", etc.


> Will the warning ever clear?
> 
Unlikely.

In your previous mail you already mentioned a 92% full OSD, that should
combined with the various "full" warnings have impressed on you the need
to address this issue.

When your nodes all rebooted, did everything come back up? 
And if so (as the 15 osds: 15 up, 15 in suggest), how much separated in
time? 
My guess is that some nodes/OSDs where restarted a lot later than others.

See inline:
> 
>   cluster df3f96d8-3889-4baa-8b27-cc2839141425
>  health HEALTH_WARN
> 2 pgs backfill_toofull
Bad, Ceph wants to place data onto these 2 PGs, but their OSDs are too
full for that.
And until something changes it will be stuck there.

Your best bet is to add more OSDs, since you seem to be short on space
anyway. Or delete unneeded data.
Given your level of experience, I'd advice against playing with weights
and the respective "full" configuration options.

> 532 pgs backfill_wait
> 3 pgs backfilling
> 330 pgs degraded
> 537 pgs stuck unclean
> 330 pgs undersized
> recovery 493335/3099981 objects degraded (15.914%)
> recovery 1377464/3099981 objects misplaced (44.435%)
Are these numbers and the recovery io below still changing, moving along?

> 8 near full osd(s)
8 out of 15, definitely needs more OSD.
Output from "ceph osd df" and "ceph osd tree" please.

>  monmap e7: 3 mons at {Monitors}
> election epoch 118, quorum 0,1,2 nodeB,nodeC,nodeD
>  osdmap e3922: 15 osds: 15 up, 15 in; 537 remapped pgs

Just to confirm, that's all the 15 OSDs your cluster ever had?

Christian

> flags sortbitwise
>   pgmap v2431741: 640 pgs, 3 pools, 3338 GB data, 864 kobjects
> 8242 GB used, 5715 GB / 13958 GB avail
> 493335/3099981 objects degraded (15.914%)
> 1377464/3099981 objects misplaced (44.435%)
>  327 active+undersized+degraded+remapped+wait_backfill
>  205 active+remapped+wait_backfill
>  103 active+clean
>3 active+undersized+degraded+remapped+backfilling
>2 active+remapped+backfill_toofull
> recovery io 367 MB/s, 96 objects/s
>   client io 5699 B/s rd, 23749 B/s wr, 2 op/s rd, 12 op/s wr


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph warning

2016-09-01 Thread Ishmael Tsoaela
Hi All,

Can someone please decipher this errors for me, after all nodes rebooted in
my cluster on Monday. the warning has not gone.

Will the warning ever clear?


  cluster df3f96d8-3889-4baa-8b27-cc2839141425
 health HEALTH_WARN
2 pgs backfill_toofull
532 pgs backfill_wait
3 pgs backfilling
330 pgs degraded
537 pgs stuck unclean
330 pgs undersized
recovery 493335/3099981 objects degraded (15.914%)
recovery 1377464/3099981 objects misplaced (44.435%)
8 near full osd(s)
 monmap e7: 3 mons at {Monitors}
election epoch 118, quorum 0,1,2 nodeB,nodeC,nodeD
 osdmap e3922: 15 osds: 15 up, 15 in; 537 remapped pgs
flags sortbitwise
  pgmap v2431741: 640 pgs, 3 pools, 3338 GB data, 864 kobjects
8242 GB used, 5715 GB / 13958 GB avail
493335/3099981 objects degraded (15.914%)
1377464/3099981 objects misplaced (44.435%)
 327 active+undersized+degraded+remapped+wait_backfill
 205 active+remapped+wait_backfill
 103 active+clean
   3 active+undersized+degraded+remapped+backfilling
   2 active+remapped+backfill_toofull
recovery io 367 MB/s, 96 objects/s
  client io 5699 B/s rd, 23749 B/s wr, 2 op/s rd, 12 op/s wr
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow Request on OSD

2016-09-01 Thread Nick Fisk
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Wido 
> den Hollander
> Sent: 01 September 2016 08:19
> To: Reed Dier 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Slow Request on OSD
> 
> 
> > Op 31 augustus 2016 om 23:21 schreef Reed Dier :
> >
> >
> > Multiple XFS corruptions, multiple leveldb issues. Looked to be result of 
> > write cache settings which have been adjusted now.

Reed, I realise that you are probably very busy attempting recovery at the 
moment, but when things calm down, I think it would be very beneficial to the 
list if you could expand on what settings caused this to happen. It might just 
stop this happening to someone else in the future.

> >
> 
> That is bad news, really bad.
> 
> > You’ll see below that there are tons of PG’s in bad states, and it was 
> > slowly but surely bringing the number of bad PGs down, but it
> seems to have hit a brick wall with this one slow request operation.
> >
> 
> No, you have more issues. You can 17 PGs which are incomplete, a few 
> down+incomplete.
> 
> Without those PGs functioning (active+X) your MDS will probably not work.
> 
> Take a look at: 
> http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/
> 
> Make sure you go to HEALTH_WARN at first, in HEALTH_ERR the MDS will never 
> come online.
> 
> Wido
> 
> > > ceph -s
> > > cluster []
> > >  health HEALTH_ERR
> > > 292 pgs are stuck inactive for more than 300 seconds
> > > 142 pgs backfill_wait
> > > 135 pgs degraded
> > > 63 pgs down
> > > 80 pgs incomplete
> > > 199 pgs inconsistent
> > > 2 pgs recovering
> > > 5 pgs recovery_wait
> > > 1 pgs repair
> > > 132 pgs stale
> > > 160 pgs stuck inactive
> > > 132 pgs stuck stale
> > > 71 pgs stuck unclean
> > > 128 pgs undersized
> > > 1 requests are blocked > 32 sec
> > > recovery 5301381/46255447 objects degraded (11.461%)
> > > recovery 6335505/46255447 objects misplaced (13.697%)
> > > recovery 131/20781800 unfound (0.001%)
> > > 14943 scrub errors
> > > mds cluster is degraded
> > >  monmap e1: 3 mons at {core=[]:6789/0,db=[]:6789/0,dev=[]:6789/0}
> > > election epoch 262, quorum 0,1,2 core,dev,db
> > >   fsmap e3627: 1/1/1 up {0=core=up:replay}
> > >  osdmap e3685: 8 osds: 8 up, 8 in; 153 remapped pgs
> > > flags sortbitwise
> > >   pgmap v1807138: 744 pgs, 10 pools, 7668 GB data, 20294 kobjects
> > > 8998 GB used, 50598 GB / 59596 GB avail
> > > 5301381/46255447 objects degraded (11.461%)
> > > 6335505/46255447 objects misplaced (13.697%)
> > > 131/20781800 unfound (0.001%)
> > >  209 active+clean
> > >  170 active+clean+inconsistent
> > >  112 stale+active+clean
> > >   74 undersized+degraded+remapped+wait_backfill+peered
> > >   63 down+incomplete
> > >   48 active+undersized+degraded+remapped+wait_backfill
> > >   19 stale+active+clean+inconsistent
> > >   17 incomplete
> > >   12 active+remapped+wait_backfill
> > >5 active+recovery_wait+degraded
> > >4 
> > > undersized+degraded+remapped+inconsistent+wait_backfill+peered
> > >4 active+remapped+inconsistent+wait_backfill
> > >2 active+recovering+degraded
> > >2 undersized+degraded+remapped+peered
> > >1 stale+active+clean+scrubbing+deep+inconsistent+repair
> > >1 active+clean+scrubbing+deep
> > >1 active+clean+scrubbing+inconsistent
> >
> >
> > Thanks,
> >
> > Reed
> >
> > > On Aug 31, 2016, at 4:08 PM, Wido den Hollander  wrote:
> > >
> > >>
> > >> Op 31 augustus 2016 om 22:56 schreef Reed Dier  > >> >:
> > >>
> > >>
> > >> After a power failure left our jewel cluster crippled, I have hit a 
> > >> sticking point in attempted recovery.
> > >>
> > >> Out of 8 osd’s, we likely lost 5-6, trying to salvage what we can.
> > >>
> > >
> > > That's probably to much. How do you mean lost? Is XFS crippled/corrupted? 
> > > That shouldn't happen.
> > >
> > >> In addition to rados pools, we were also using CephFS, and the 
> > >> cephfs.metadata and cephfs.data pools likely lost plenty of PG’s.
> > >>
> > >
> > > What is the status of all PGs? What does 'ceph -s' show?
> > >
> > > Are all PGs active? Since that's something which needs to be done first.
> > >
> > >> The mds has reported this ever since returning from the power loss:
> > >>> # ceph mds stat
> > >>> 

Re: [ceph-users] Slow Request on OSD

2016-09-01 Thread Wido den Hollander

> Op 31 augustus 2016 om 23:21 schreef Reed Dier :
> 
> 
> Multiple XFS corruptions, multiple leveldb issues. Looked to be result of 
> write cache settings which have been adjusted now.
> 

That is bad news, really bad.

> You’ll see below that there are tons of PG’s in bad states, and it was slowly 
> but surely bringing the number of bad PGs down, but it seems to have hit a 
> brick wall with this one slow request operation.
> 

No, you have more issues. You can 17 PGs which are incomplete, a few 
down+incomplete.

Without those PGs functioning (active+X) your MDS will probably not work.

Take a look at: 
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/

Make sure you go to HEALTH_WARN at first, in HEALTH_ERR the MDS will never come 
online.

Wido

> > ceph -s
> > cluster []
> >  health HEALTH_ERR
> > 292 pgs are stuck inactive for more than 300 seconds
> > 142 pgs backfill_wait
> > 135 pgs degraded
> > 63 pgs down
> > 80 pgs incomplete
> > 199 pgs inconsistent
> > 2 pgs recovering
> > 5 pgs recovery_wait
> > 1 pgs repair
> > 132 pgs stale
> > 160 pgs stuck inactive
> > 132 pgs stuck stale
> > 71 pgs stuck unclean
> > 128 pgs undersized
> > 1 requests are blocked > 32 sec
> > recovery 5301381/46255447 objects degraded (11.461%)
> > recovery 6335505/46255447 objects misplaced (13.697%)
> > recovery 131/20781800 unfound (0.001%)
> > 14943 scrub errors
> > mds cluster is degraded
> >  monmap e1: 3 mons at {core=[]:6789/0,db=[]:6789/0,dev=[]:6789/0}
> > election epoch 262, quorum 0,1,2 core,dev,db
> >   fsmap e3627: 1/1/1 up {0=core=up:replay}
> >  osdmap e3685: 8 osds: 8 up, 8 in; 153 remapped pgs
> > flags sortbitwise
> >   pgmap v1807138: 744 pgs, 10 pools, 7668 GB data, 20294 kobjects
> > 8998 GB used, 50598 GB / 59596 GB avail
> > 5301381/46255447 objects degraded (11.461%)
> > 6335505/46255447 objects misplaced (13.697%)
> > 131/20781800 unfound (0.001%)
> >  209 active+clean
> >  170 active+clean+inconsistent
> >  112 stale+active+clean
> >   74 undersized+degraded+remapped+wait_backfill+peered
> >   63 down+incomplete
> >   48 active+undersized+degraded+remapped+wait_backfill
> >   19 stale+active+clean+inconsistent
> >   17 incomplete
> >   12 active+remapped+wait_backfill
> >5 active+recovery_wait+degraded
> >4 
> > undersized+degraded+remapped+inconsistent+wait_backfill+peered
> >4 active+remapped+inconsistent+wait_backfill
> >2 active+recovering+degraded
> >2 undersized+degraded+remapped+peered
> >1 stale+active+clean+scrubbing+deep+inconsistent+repair
> >1 active+clean+scrubbing+deep
> >1 active+clean+scrubbing+inconsistent
> 
> 
> Thanks,
> 
> Reed
> 
> > On Aug 31, 2016, at 4:08 PM, Wido den Hollander  wrote:
> > 
> >> 
> >> Op 31 augustus 2016 om 22:56 schreef Reed Dier  >> >:
> >> 
> >> 
> >> After a power failure left our jewel cluster crippled, I have hit a 
> >> sticking point in attempted recovery.
> >> 
> >> Out of 8 osd’s, we likely lost 5-6, trying to salvage what we can.
> >> 
> > 
> > That's probably to much. How do you mean lost? Is XFS crippled/corrupted? 
> > That shouldn't happen.
> > 
> >> In addition to rados pools, we were also using CephFS, and the 
> >> cephfs.metadata and cephfs.data pools likely lost plenty of PG’s.
> >> 
> > 
> > What is the status of all PGs? What does 'ceph -s' show?
> > 
> > Are all PGs active? Since that's something which needs to be done first.
> > 
> >> The mds has reported this ever since returning from the power loss:
> >>> # ceph mds stat
> >>> e3627: 1/1/1 up {0=core=up:replay}
> >> 
> >> 
> >> When looking at the slow request on the osd, it shows this task which I 
> >> can’t quite figure out. Any help appreciated.
> >> 
> > 
> > Are all clients (including MDS) and OSDs running the same version?
> > 
> > Wido
> > 
> >>> # ceph --admin-daemon /var/run/ceph/ceph-osd.5.asok dump_ops_in_flight
> >>> {
> >>>"ops": [
> >>>{
> >>>"description": "osd_op(mds.0.3625:8 6.c5265ab3 (undecoded) 
> >>> ack+retry+read+known_if_redirected+full_force e3668)",
> >>>"initiated_at": "2016-08-31 10:37:18.833644",
> >>>"age": 22212.235361,
> >>>"duration": 22212.235379,
> >>>"type_data": [
> >>>"no flag points reached",
> >>>