Re: [ceph-users] Failed to repair pg

2019-03-07 Thread David Zafman


On 3/7/19 9:32 AM, Herbert Alexander Faleiros wrote:

On Thu, Mar 07, 2019 at 01:37:55PM -0300, Herbert Alexander Faleiros wrote:
Should I do something like this? (below, after stop osd.36)

# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-36/ --journal-path 
/dev/sdc1 rbd_data.dfd5e2235befd0.0001c299 remove-clone-metadata 326022

I'm no sure about rbd_data.$RBD and $CLONEID (took from rados
list-inconsistent-obj, also below).



See what results you get from this command.

# rados list-inconsistent-snapset 2.2bb --format=json-pretty

You might see this, so nothing interesting.  If you don't get json, then 
re-run a scrub again.


{
    "epoch": ##,
    "inconsistents": []
}

I don't think you need to do the remove-clone-metadata because you got 
"unexpected clone" so I think you'd get "Clone 326022 not present"


I think you need to remove the clone object from osd.12 and osd.80.  For 
example:


# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-12/ 
--journal-path /dev/sdXX --op list rbd_data.dfd5e2235befd0.0001c299


["2.2bb",{"oid":"rbd_data.dfd5e2235befd0.0001c299","key":"","snapid":-2,"hash":,"max":0,"pool":2,"namespace":"","max":0}]
["2.2bb",{"oid":"rbd_data.dfd5e2235befd0.0001c299","key":"","snapid":326022,"hash":#,"max":0,"pool":2,"namespace":"","max":0}]

Use the json for snapid 326022 to remove it.

# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-12/ 
--journal-path /dev/sdXX 
'["2.2bb",{"oid":"rbd_data.dfd5e2235befd0.0001c299","key":"","snapid":326022,"hash":#,"max":0,"pool":2,"namespace":"","max":0}]' 
remove



David

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] backfill_toofull while OSDs are not full

2019-01-30 Thread David Zafman


Strange, I can't reproduce this with v13.2.4.  I tried the following 
scenarios:


pg acting 1, 0, 2 -> up 1, 0 4 (osd.2 marked out).  The df on osd.2 
shows 0 space, but only osd.4 (backfill target) checks full space.


pg acting 1, 0, 2 -> up 4,3,5 (osd,1,0,2 all marked out).  The df for 
1,0,2 show 0 space but osd.4,3,4 (backafill targets) check full space.


FYI, In a later release even when a backfill target is below 
backfillfull_ratio, if there isn't enough room for the pg to fit then 
backfill_toofull occurs.



The question in your case is was any of  OSDs 999, 1900, or 145 above 
90% (backfillfull_ratio) usage.


David

On 1/27/19 11:34 PM, Wido den Hollander wrote:


On 1/25/19 8:33 AM, Gregory Farnum wrote:

This doesn’t look familiar to me. Is the cluster still doing recovery so
we can at least expect them to make progress when the “out” OSDs get
removed from the set?

The recovery has already finished. It resolves itself, but in the
meantime I saw many PGs in the backfill_toofull state for a long time.

This is new since Mimic.

Wido


On Tue, Jan 22, 2019 at 2:44 PM Wido den Hollander mailto:w...@42on.com>> wrote:

 Hi,

 I've got a couple of PGs which are stuck in backfill_toofull, but none
 of them are actually full.

   "up": [
     999,
     1900,
     145
   ],
   "acting": [
     701,
     1146,
     1880
   ],
   "backfill_targets": [
     "145",
     "999",
     "1900"
   ],
   "acting_recovery_backfill": [
     "145",
     "701",
     "999",
     "1146",
     "1880",
     "1900"
   ],

 I checked all these OSDs, but they are all <75% utilization.

 full_ratio 0.95
 backfillfull_ratio 0.9
 nearfull_ratio 0.9

 So I started checking all the PGs and I've noticed that each of these
 PGs has one OSD in the 'acting_recovery_backfill' which is marked as
 out.

 In this case osd.1880 is marked as out and thus it's capacity is shown
 as zero.

 [ceph@ceph-mgr ~]$ ceph osd df|grep 1880
 1880   hdd 4.54599        0     0 B      0 B      0 B     0    0  27
 [ceph@ceph-mgr ~]$

 This is on a Mimic 13.2.4 cluster. Is this expected or is this a unknown
 side-effect of one of the OSDs being marked as out?

 Thanks,

 Wido
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com 
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Understanding/correcting sudden onslaught of unfound objects

2018-03-14 Thread David Zafman


The fix for tracker 20089 undid the changes you're seeing in the 15368 
pull request.  The attr name mismatch of 'hinfo_key'  means that key is 
missing because every erasure coded object should have a key called 
"hinfo_key."


You should try to determine why your extended attributes are getting 
corrupted.  All the errors are on shard 0.  My testing shows that repair 
will fix this scenario.


David


On 3/13/18 3:48 PM, Graham Allan wrote:
Updated cluster now to 12.2.4 and the cycle of 
inconsistent->repair->unfound seems to continue, though possibly 
slightly differently. A pg does pass through an "active+clean" phase 
after repair, which might be new, but more likely I never observed it 
at the right time before.


I see messages like this in the logs now "attr name mismatch 
'hinfo_key'" - perhaps this might cast more light on the cause:


2018-03-02 18:55:11.583850 osd.386 osd.386 10.31.0.72:6817/4057280 
401 : cluster [ERR] 70.3dbs0 : soid 
70:dbc6ed68:::default.325674.85_bellplants_images%2f1055211.jpg:head 
attr name mismatch 'hinfo_key'
2018-03-02 19:00:18.031929 osd.386 osd.386 10.31.0.72:6817/4057280 
428 : cluster [ERR] 70.3dbs0 : soid 
70:dbc97561:::default.325674.85_bellplants_images%2f1017818.jpg:head 
attr name mismatch 'hinfo_key'
2018-03-02 19:04:50.058477 osd.386 osd.386 10.31.0.72:6817/4057280 
452 : cluster [ERR] 70.3dbs0 : soid 
70:dbcbcb34:::default.325674.85_bellplants_images%2f1049756.jpg:head 
attr name mismatch 'hinfo_key'
2018-03-02 19:13:05.689136 osd.386 osd.386 10.31.0.72:6817/4057280 
494 : cluster [ERR] 70.3dbs0 : soid 
70:dbcfc7c9:::default.325674.85_bellplants_images%2f1021177.jpg:head 
attr name mismatch 'hinfo_key'
2018-03-02 19:13:30.883100 osd.386 osd.386 10.31.0.72:6817/4057280 
495 : cluster [ERR] 70.3dbs0 repair 0 missing, 161 inconsistent objects
2018-03-02 19:13:30.888259 osd.386 osd.386 10.31.0.72:6817/4057280 
496 : cluster [ERR] 70.3db repair 161 errors, 161 fixed


The only similar-sounding issue I could find is

http://tracker.ceph.com/issues/20089

When I look at src/osd/PGBackend.cc be_compare_scrubmaps in luminous, 
I don't see the changes in the commit here:


https://github.com/ceph/ceph/pull/15368/files

of course a lot of other things have changed, but is it possible this 
fix never made it into luminous?


Graham

On 02/17/2018 12:48 PM, David Zafman wrote:


The commits below came after v12.2.2 and may impact this issue. When 
a pg is active+clean+inconsistent means that scrub has detected 
issues with 1 or more replicas of 1 or more objects . An unfound 
object is a potentially temporary state in which the current set of 
available OSDs doesn't allow an object to be 
recovered/backfilled/repaired.  When the primary OSD restarts, any 
unfound objects ( an in memory structure) are reset so that the new 
set of peered OSDs can determine again what objects are unfound.


I'm not clear in this scenario whether recovery failed to start, 
recovery hung before due to a bug or if recovery stopped (as 
designed) because of the unfound object.  The new recovery_unfound 
and backfill_unfound states indicates that recovery has stopped due 
to unfound objects.



commit 64047e1bac2e775a06423a03cfab69b88462538c
Author: David Zafman <dzaf...@redhat.com>
Date:   Wed Jan 10 13:30:41 2018 -0800

 osd: Don't start recovery for missing until active pg state set

 I was seeing recovery hang when it is started before 
_activate_committed()
 The state machine passes into "Active" but this transitions to 
activating

 pg state and after commmitted into "active" pg state.

 Signed-off-by: David Zafman <dzaf...@redhat.com>

commit 7f8b0ce9e681f727d8217e3ed74a1a3355f364f3
Author: David Zafman <dzaf...@redhat.com>
Date:   Mon Oct 9 08:19:21 2017 -0700

 osd, mon: Add new pg states recovery_unfound and backfill_unfound

 Signed-off-by: David Zafman <dzaf...@redhat.com>



On 2/16/18 1:40 PM, Gregory Farnum wrote:

On Fri, Feb 16, 2018 at 12:17 PM Graham Allan <g...@umn.edu> wrote:


On 02/16/2018 12:31 PM, Graham Allan wrote:

If I set debug rgw=1 and demug ms=1 before running the "object stat"
command, it seems to stall in a loop of trying communicate with 
osds for

pool 96, which is .rgw.control


10.32.16.93:0/2689814946 --> 10.31.0.68:6818/8969 --
osd_op(unknown.0.0:541 96.e 96:7759931f:::notify.3:head [watch ping
cookie 139709246356176] snapc 0=[] ondisk+write+known_if_redirected
e507695) v8 -- 0x7f10ac033610 con 0
10.32.16.93:0/2689814946 <== osd.38 10.31.0.68:6818/8969 59 
osd_op_reply(541 notify.3 [watch ping cookie 139709246356176] v0'0
uv3933745 ondisk = 0) v8  152+0+0 (2536111836 
<(253)%20611-1836> 0

0) 0x7f1158003e20

con 0x7f117afd8390
Prior to that, probably more relevant, this was the only 
communication

logged with the primary osd of the pg:


10.32.16.93:0/1552085932 --> 10.31.0.71:6838/66301 --
osd_op(unkn

Re: [ceph-users] Understanding/correcting sudden onslaught of unfound objects

2018-02-17 Thread David Zafman


The commits below came after v12.2.2 and may impact this issue. When a 
pg is active+clean+inconsistent means that scrub has detected issues 
with 1 or more replicas of 1 or more objects .  An unfound object is a 
potentially temporary state in which the current set of available OSDs 
doesn't allow an object to be recovered/backfilled/repaired.  When the 
primary OSD restarts, any unfound objects ( an in memory structure) are 
reset so that the new set of peered OSDs can determine again what 
objects are unfound.


I'm not clear in this scenario whether recovery failed to start, 
recovery hung before due to a bug or if recovery stopped (as designed) 
because of the unfound object.  The new recovery_unfound and 
backfill_unfound states indicates that recovery has stopped due to 
unfound objects.



commit 64047e1bac2e775a06423a03cfab69b88462538c
Author: David Zafman <dzaf...@redhat.com>
Date:   Wed Jan 10 13:30:41 2018 -0800

    osd: Don't start recovery for missing until active pg state set

    I was seeing recovery hang when it is started before 
_activate_committed()
    The state machine passes into "Active" but this transitions to 
activating

    pg state and after commmitted into "active" pg state.

    Signed-off-by: David Zafman <dzaf...@redhat.com>

commit 7f8b0ce9e681f727d8217e3ed74a1a3355f364f3
Author: David Zafman <dzaf...@redhat.com>
Date:   Mon Oct 9 08:19:21 2017 -0700

    osd, mon: Add new pg states recovery_unfound and backfill_unfound

    Signed-off-by: David Zafman <dzaf...@redhat.com>



On 2/16/18 1:40 PM, Gregory Farnum wrote:

On Fri, Feb 16, 2018 at 12:17 PM Graham Allan <g...@umn.edu> wrote:


On 02/16/2018 12:31 PM, Graham Allan wrote:

If I set debug rgw=1 and demug ms=1 before running the "object stat"
command, it seems to stall in a loop of trying communicate with osds for
pool 96, which is .rgw.control


10.32.16.93:0/2689814946 --> 10.31.0.68:6818/8969 --
osd_op(unknown.0.0:541 96.e 96:7759931f:::notify.3:head [watch ping
cookie 139709246356176] snapc 0=[] ondisk+write+known_if_redirected
e507695) v8 -- 0x7f10ac033610 con 0
10.32.16.93:0/2689814946 <== osd.38 10.31.0.68:6818/8969 59 
osd_op_reply(541 notify.3 [watch ping cookie 139709246356176] v0'0
uv3933745 ondisk = 0) v8  152+0+0 (2536111836 <(253)%20611-1836> 0

0) 0x7f1158003e20

con 0x7f117afd8390

Prior to that, probably more relevant, this was the only communication
logged with the primary osd of the pg:


10.32.16.93:0/1552085932 --> 10.31.0.71:6838/66301 --
osd_op(unknown.0.0:96 70.438s0
70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head
[getxattrs,stat] snapc 0=[] ondisk+read+known_if_redirected e507695)
v8 -- 0x7fab79889fa0 con 0
10.32.16.93:0/1552085932 <== osd.175 10.31.0.71:6838/66301 1 
osd_backoff(70.438s0 block id 1


[70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head,70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head)

e507695) v1  209+0+0 (1958971312 0 0) 0x7fab5003d3c0 con
0x7fab79885980
210.32.16.93:0/1552085932 --> 10.31.0.71:6838/66301 --
osd_backoff(70.438s0 ack-block id 1


[70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head,70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head)

e507695) v1 -- 0x7fab48065420 con 0

so I guess the backoff message above is saying the object is
unavailable. OK, that certainly makes sense. Not sure that it helps me
understand how to fix the inconsistencies

If I restart the primary osd for the pg, that makes it forget its state
and return to active+clean+inconsistent. I can then download the
previously-unfound objects again, as well as run "radosgw-admin object
stat".

So the interesting bit is probably figuring out why it decides these
objects are unfound, when clearly they aren't.

What would be the best place to enable additional logging to understand
this - perhaps the primary osd?


David, this sounds like one of the bugs where an OSD can mark objects as
inconsistent locally but then doesn't actually trigger recovery on them. Or
it doesn't like any copy but doesn't persist that.
Do any known issues around that apply to 12.2.2?
-Greg



Thanks for all your help,

Graham
--
Graham Allan
Minnesota Supercomputing Institute - g...@umn.edu



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ghost degraded objects

2018-01-22 Thread David Zafman


Yes, the pending backport for what we have so far is in 
https://github.com/ceph/ceph/pull/20055


With this changes a backfill caused by marking an osd out has the 
results as shown:



    health: HEALTH_WARN
    115/600 objects misplaced (19.167%)

...
  data:
    pools:   1 pools, 1 pgs
    objects: 200 objects, 310 kB
    usage:   173 GB used, 126 GB / 299 GB avail
    pgs: 115/600 objects misplaced (19.167%)
 1 active+remapped+backfilling

David


On 1/19/18 5:14 AM, Sage Weil wrote:

On Fri, 19 Jan 2018, Ugis wrote:

Running Luminous 12.2.2, noticed strange behavior lately.
When for example setting "ceph osd out X" closer to the reballancing
end "degraded" objects still show up, but in "pgs:" section of ceph -s
no degraded pgs are still recovering, just ramapped and no degraded
pgs can be found in "ceph pg dump"

   health: HEALTH_WARN
 355767/30286841 objects misplaced (1.175%)
 Degraded data redundancy: 28/30286841 objects degraded
(0.000%), 96 pgs unclean

   services:
 ...
 osd: 38 osds: 38 up, 37 in; 96 remapped pgs

   data:
 pools:   19 pools, 4176 pgs
 objects: 9859k objects, 39358 GB
 usage:   114 TB used, 120 TB / 234 TB avail
 pgs: 28/30286841 objects degraded (0.000%)
  355767/30286841 objects misplaced (1.175%)
  4080 active+clean
  81   active+remapped+backfilling
  15   active+remapped+backfill_wait


Where those 28 degraded objects come from?

There aren't actually degraded objects.. in this case it's just
misreporting that there are.

This is a known issue in luminous.  Shortly after release we noticed the
problem and David has been working on several changes to the stats
calculation to improve the reporting, but those changes have not been
backported (and aren't quite complete, either--getting a truly accurate
number there is nontrivial in some cases it turns out).


In such cases usually when backfilling is done degraded objects also
disappear, but normally degraded objects should fix before remapped
ones by priority.

Yes.

It's unfortunately a scary warning (there shouldn't be degraded
objects... and generally speaking aren't) that understandably alarms
users.  We hope to have this sorted out soon!

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] FAILED assert(p.same_interval_since) and unusable cluster

2017-11-01 Thread David Zafman


Jon,

    If you are able please test my tentative fix for this issue which 
is in https://github.com/ceph/ceph/pull/18673



Thanks

David


On 10/30/17 1:13 AM, Jon Light wrote:

Hello,

I have three OSDs that are crashing on start with a FAILED
assert(p.same_interval_since) error. I ran across a thread from a few days
ago about the same issue and a ticket was created here:
http://tracker.ceph.com/issues/21833.

A very overloaded node in my cluster OOM'd many times which eventually led
to the problematic PGs and then the failed assert.

I currently have 49 pgs inactive, 33 pgs down, 15 pgs incomplete as well as
0.028% of objects unfound. Presumably due to this, I can't add any data to
the FS or read some data. Just about any IO ends up in a good bit of stuck
requests.

Hopefully a fix can come from the issue, but can anyone give me some
suggestions or guidance to get the cluster in a working state in the
meantime?

Thanks



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Osd FAILED assert(p.same_interval_since)

2017-10-16 Thread David Zafman


I don't see that same_interval_since being cleared by split. 
PG::split_into() copies the history from the parent PG to child. The 
only code in Luminous that I see that clears it is in 
ceph_objectstore_tool.cc


David


On 10/16/17 3:59 PM, Gregory Farnum wrote:

On Mon, Oct 16, 2017 at 3:49 PM Dejan Lesjak  wrote:


On 17. okt. 2017, at 00:23, Gregory Farnum  wrote:

On Mon, Oct 16, 2017 at 8:24 AM Dejan Lesjak 

wrote:

On 10/16/2017 02:02 PM, Dejan Lesjak wrote:

Hi,

During rather high load and rebalancing, a couple of our OSDs crashed
and they fail to start. This is from the log:

 -2> 2017-10-16 13:27:50.235204 7f5e4c3bae80  0 osd.1 442123

load_pgs

opened 370 pgs
 -1> 2017-10-16 13:27:50.239175 7f5e4c3bae80  1 osd.1 442123
build_past_intervals_parallel over 439159-439159
  0> 2017-10-16 13:27:50.261883 7f5e4c3bae80 -1


/var/tmp/portage/sys-cluster/ceph-12.2.1/work/ceph-12.2.1/src/osd/OSD.cc:

In function 'void OSD::build_past_intervals_parallel()' thread
7f5e4c3bae80 time 2017-10-16 13:27:50.260062


/var/tmp/portage/sys-cluster/ceph-12.2.1/work/ceph-12.2.1/src/osd/OSD.cc:

4177: FAILED assert(p.same_interval_since)

  ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e)

luminous

(stable)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x55e4caa18592]
  2: (OSD::build_past_intervals_parallel()+0x1d7b) [0x55e4ca453e8b]
  3: (OSD::load_pgs()+0x14cb) [0x55e4ca45564b]
  4: (OSD::init()+0x2227) [0x55e4ca467327]
  5: (main()+0x2d5a) [0x55e4ca379b1a]
  6: (__libc_start_main()+0xf1) [0x7f5e48ee35d1]
  7: (_start()+0x2a) [0x55e4ca4039aa]
  NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

Does anybody know how to fix or further debug this?

Bumped logging to 10 and posted log to https://pastebin.com/raw/StTeYWRt
 From "10.1fces2 needs 439159-0" it seems osd (osd.1) gets stuck at pg
10.1fce. Yet pg map doesn't show osd.1 for this pg:

# ceph pg map 10.1fce
osdmap e443665 pg 10.1fce (10.1fce) -> up [110,213,132,182] acting
[110,213,132,182]

Hmm, this is odd. What caused your rebalancing exactly? Can you turn on

the OSD with debugging set to 20, and then upload the log file using
ceph-post-file?

The specific assert you're hitting here is supposed to cope with PGs

that have been imported (via the ceph-objectstore-tool). But obviously
something has gone wrong here.

It started when we bumped the number of PGs for a pool (from 2048 to 8192).
I’ve sent the log with ID 3a6dea4f-05d7-4c15-9f7e-2d95d99195ba

It actually seems similar than http://tracker.ceph.com/issues/21142 in
that the pg found in log seems empty if checked with ceph-objectstore-tool
and removing it allows the osd to start. At least on one osd, I’ve not
tried that yet on all of the failed ones.


Ah. I bet we are default-constructing the "child" PGs from split with that
value set to zero, so it's incorrectly being flagged for later use. David,
does that make sense to you? Do you think it's reasonable to fix it by just
checking for other default-initialized values as part of that branch check?
(I note that this code got removed once Luminous branched, so hopefully
there's a simple fix we can apply!)

Dejan, did you make sure the OSD you tried that on has re-created the
removed PG and populated it with data? If so I think you ought to be fine
removing any empty PGs which are causing this assert.
-Greg



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] objects degraded higher than 100%

2017-10-13 Thread David Zafman


I improved the code to compute degraded objects during 
backfill/recovery.  During my testing it wouldn't result in percentage 
above 100%.  I'll have to look at the code and verify that some 
subsequent changes didn't break things.


David


On 10/13/17 9:55 AM, Florian Haas wrote:

Okay, in that case I've no idea. What was the timeline for the recovery
versus the rados bench and cleanup versus the degraded object counts,
then?

1. Jewel deployment with filestore.
2. Upgrade to Luminous (including mgr deployment and "ceph osd
require-osd-release luminous"), still on filestore.
3. rados bench with subsequent cleanup.
4. All OSDs up, all  PGs active+clean.
5. Stop one OSD. Remove from CRUSH, auth list, OSD map.
6. Reinitialize OSD with bluestore.
7. Start OSD, commencing backfill.
8. Degraded objects above 100%.

Please let me know if that information is useful. Thank you!


Hmm, that does leave me a little perplexed.

Yeah exactly, me too. :)


David, do we maybe do something with degraded counts based on the number of
objects identified in pg logs? Or some other heuristic for number of objects
that might be stale? That's the only way I can think of to get these weird
returning sets.

One thing that just crossed my mind: would it make a difference
whether after the OSD goes out or not, in the time window between it
going down and being deleted from the crushmap/osdmap? I think it
shouldn't (whether being marked out or just non-existent, it's not
eligible for holding any data so either way), but I'm not really sure
about the mechanics of the internals here.

Cheers,
Florian


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inconsistent pg will not repair

2017-09-26 Thread David Zafman


The following is based on the discussion in: 
http://tracker.ceph.com/issues/21388


--

There is a particular scenario which if identified can be repaired 
manually. In this case the automatic repair rejects all copies because 
none match the selected_object_info thus setting data_digest_mismatch_oi 
on all shards.


Doing the following should produce list-inconsistent-obj information:

$ ceph pg deep-scrub 1.0
(Wait for scrub to finish)
$ rados list-inconsistent-obj 1.0 --format=json-pretty

Requirements:

1. data_digest_mismatch_oi is set on all shards make it unrepairable
2. union_shard_errors has only data_digest_mismatch_oi listed, no other
   issues involved
3. Object "errors" is empty { "inconsistent": [ { ..."errors": []}
   ] } which means the data_digest value is the same on all shards
   (0x2d4a11c2 in the example below)
4. No down OSDs which might have different/correct data

To fix use rados get/put followed by a deep-scrub to clear the 
"inconsistent" pg state.  Use -b option with a value smaller than the 
file size so that the read doesn't compare the digest and return EIO.


1. rados -p pool -b 10240 get mytestobject tempfile
2. rados -p pool put mytestobject tempfile
3. ceph pg deep-scrub 1.0


Here is an example list-inconsistent-obj output of what this scenario 
looks like:


{ "inconsistents": [ { "shards": [ { "data_digest": "0x2d4a11c2", 
"omap_digest": "0xf5fba2c6", "size": 143456, "errors": [ 
"data_digest_mismatch_oi" ], "osd": 0, "primary": true }, { 
"data_digest": "0x2d4a11c2", "omap_digest": "0xf5fba2c6", "size": 
143456, "errors": [ "data_digest_mismatch_oi" ], "osd": 1, "primary": 
false }, { "data_digest": "0x2d4a11c2", "omap_digest": "0xf5fba2c6", 
"size": 143456, "errors": [ "data_digest_mismatch_oi" ], "osd": 2, 
"primary": false } ], "selected_object_info": "3:ce3f1d6a::: 
mytestobject:head(47'54 osd.0.0:53 dirty|omap|data_digest|omap_digest s 
143456 uv 3 dd 2ddbf8f5 od f5fba2c6 alloc_hint [0 0 0])", 
"union_shard_errors": [ "data_digest_mismatch_oi" ], "errors": [ ], 
"object": { "version": 3, "snap": "head", "locator": "", "nspace": "", 
"name": "mytestobject" } } ], "epoch": 103443 }



David

On 9/26/17 10:55 AM, Gregory Farnum wrote:

[ Re-send due to HTML email part]

IIRC, this is because the object info and the actual object disagree
about what the checksum should be. I don't know the best way to fix it
off-hand but it's been discussed on the list (try searching for email
threads involving David Zafman).
-Greg

On Tue, Sep 26, 2017 at 7:03 AM, Wyllys Ingersoll
<wyllys.ingers...@keepertech.com> wrote:

I have an inconsistent PG that I cannot seem to get to repair cleanly.
I can find the 3 objects in question and they all have the same size
and md5sum, but yet whenever I repair it, it is reported as an error
"failed to pick suitable auth object".

Any suggestions for fixing or workaround this issue to resolve the
inconsistency?

Ceph 10.2.9
Ubuntu 16.04.2


2017-09-26 09:54:03.123938 7fd31048e700 -1 log_channel(cluster) log
[ERR] : 1.5b8 shard 7: soid 1:1daab06b:::14d6662.:head
data_digest 0x923deb74 != data_digest 0x23f10be8 from auth oi
1:1daab06b:::14d6662.:head(204442'221517
client.5654254.1:2371279 dirty|data_digest|omap_digest s 1421644 uv
203993 dd 23f10be8 od  alloc_hint [0 0])
2017-09-26 09:54:03.123944 7fd31048e700  0 log_channel(cluster) do_log
log to syslog
2017-09-26 09:54:03.123999 7fd31048e700 -1 log_channel(cluster) log
[ERR] : 1.5b8 shard 26: soid 1:1daab06b:::14d6662.:head
data_digest 0x923deb74 != data_digest 0x23f10be8 from auth oi
1:1daab06b:::14d6662.:head(204442'221517
client.5654254.1:2371279 dirty|data_digest|omap_digest s 1421644 uv
203993 dd 23f10be8 od  alloc_hint [0 0])
2017-09-26 09:54:03.124005 7fd31048e700  0 log_channel(cluster) do_log
log to syslog
2017-09-26 09:54:03.124013 7fd31048e700 -1 log_channel(cluster) log
[ERR] : 1.5b8 shard 44: soid 1:1daab06b:::14d6662.:head
data_digest 0x923deb74 != data_digest 0x23f10be8 from auth oi
1:1daab06b:::14d6662.:head(204442'221517
client.5654254.1:2371279 dirty|data_digest|omap_digest s 1421644 uv
203993 dd 23f10be8 od  alloc_hint [0 0])
2017-09-26 09:54:03.124015 7fd31048e700  0 log_channel(cluster) do_log
log to syslog
2017-09-26 09:54:03.124022 7fd31048e700 -1 log_channel(cluster) log
[ERR] : 1.5b8 soid 1:1daab06b:::14d6662.:head: failed to
pick suitable auth object

Re: [ceph-users] Significant uptick in inconsistent pgs in Jewel 10.2.9

2017-09-08 Thread David Zafman


Robin,

Would you generate the values and keys for the various versions of at 
least one of the objects?   .dir.default.292886573.13181.12 is a good 
example because there are 3 variations for the same object.


If there isn't much activity to .dir.default.64449186.344176, you could 
do one osd at a time.  Otherwise, stop all 3 OSDs 1322, 990, 655 execute 
these for all 3.  I suspect you'll need to pipe to "od-cx" to get 
printable output.


I created a simple object with ascii omap.

$ ceph-objectstore-tool --data-path ... --pgid 5.3d40 
.dir.default.64449186.344176 get-omaphdr

obj_header

$ for i in $(ceph-objectstore-tool --data-path ... --pgid 5.3d40 
.dir.default.64449186.344176 list-omap)


do

echo -n "${i}: "

ceph-objectstore-tool --data-path ... .dir.default.292886573.13181.12 
get-omap $i


done

key1: val1
key2: val2
key3: val3

David


On 9/8/17 12:18 PM, David Zafman wrote:


Robin,

The only two changesets I can spot in Jewel that I think might be 
related are these:

1.
http://tracker.ceph.com/issues/20089
https://github.com/ceph/ceph/pull/15416


This should improve the repair functionality.


2.
http://tracker.ceph.com/issues/19404
https://github.com/ceph/ceph/pull/14204

This pull request fixes an issue that corrupted omaps.  It also finds 
and repairs them.  However, the repair process might resurrect deleted 
omaps which would show up as an omap digest error.


This could temporarily cause additional inconsistent PGs.  So if this 
has NOT been occurring longer than your deep-scrub interval since 
upgrading, I'd repair the pgs and monitor going forward to make sure 
the problem doesn't recur.


---

You have good example of repair scenarios:


.dir.default.292886573.13181.12   only has a omap_digest_mismatch and 
no shard errors.  The automatic repair won't be sure which is a good 
copy.


In this case we can see that osd 1327 doesn't match the other two.  To 
assist the repair process to repair the right one. Remove the copy on 
osd.1327


Stop osd 1327 and use "ceph-objectstore-tool --data-path .1327 
.dir.default.292886573.13181.12 remove"



.dir.default.64449186.344176 has selected_object_info with "od 
337cf025" so shards have "omap_digest_mismatch_oi" except for osd 990.


The pg repair code will use osd.990 to fix the other 2 copies without 
further handling.



David



On 9/8/17 11:16 AM, Robin H. Johnson wrote:

On Thu, Sep 07, 2017 at 08:24:04PM +, Robin H. Johnson wrote:

pg 5.3d40 is active+clean+inconsistent, acting [1322,990,655]
pg 5.f1c0 is active+clean+inconsistent, acting [631,1327,91]

Here is the output of 'rados list-inconsistent-obj' for the PGs:

$ sudo rados list-inconsistent-obj 5.f1c0 |json_pp -json_opt 
canonical,pretty

{
    "epoch" : 1221254,
    "inconsistents" : [
   {
  "errors" : [
 "omap_digest_mismatch"
  ],
  "object" : {
 "locator" : "",
 "name" : ".dir.default.292886573.13181.12",
 "nspace" : "",
 "snap" : "head",
 "version" : 483490
  },
  "selected_object_info" : 
"5:038f1cff:::.dir.default.292886573.13181.12:head(1221843'483490 
client.417313345.0:19515832 dirty|omap|data_digest s 0 uv 483490 dd 
 alloc_hint [0 0])",

  "shards" : [
 {
    "data_digest" : "0x",
    "errors" : [],
    "omap_digest" : "0x928b0c0b",
    "osd" : 91,
    "size" : 0
 },
 {
    "data_digest" : "0x",
    "errors" : [],
    "omap_digest" : "0x928b0c0b",
    "osd" : 631,
    "size" : 0
 },
 {
    "data_digest" : "0x",
    "errors" : [],
    "omap_digest" : "0x6556c868",
    "osd" : 1327,
    "size" : 0
 }
  ],
  "union_shard_errors" : []
   }
    ]
}
$ sudo rados list-inconsistent-obj 5.3d40  |json_pp -json_opt 
canonical,pretty

{
    "epoch" : 1210895,
    "inconsistents" : [
   {
  "errors" : [
 "omap_digest_mismatch"
  ],
  "object" : {
 "locator" : "",
 "name" : ".dir.default.64449186.344176",
 "nspace" : "",
 "snap" : "head",
 "version" : 1177199
 

Re: [ceph-users] Significant uptick in inconsistent pgs in Jewel 10.2.9

2017-09-08 Thread David Zafman


Robin,

The only two changesets I can spot in Jewel that I think might be 
related are these:

1.
http://tracker.ceph.com/issues/20089
https://github.com/ceph/ceph/pull/15416


This should improve the repair functionality.


2.
http://tracker.ceph.com/issues/19404
https://github.com/ceph/ceph/pull/14204

This pull request fixes an issue that corrupted omaps.  It also finds 
and repairs them.  However, the repair process might resurrect deleted 
omaps which would show up as an omap digest error.


This could temporarily cause additional inconsistent PGs.  So if this 
has NOT been occurring longer than your deep-scrub interval since 
upgrading, I'd repair the pgs and monitor going forward to make sure the 
problem doesn't recur.


---

You have good example of repair scenarios:


.dir.default.292886573.13181.12   only has a omap_digest_mismatch and no 
shard errors.  The automatic repair won't be sure which is a good copy.


In this case we can see that osd 1327 doesn't match the other two.  To 
assist the repair process to repair the right one. Remove the copy on 
osd.1327


Stop osd 1327 and use "ceph-objectstore-tool --data-path .1327 
.dir.default.292886573.13181.12 remove"



.dir.default.64449186.344176 has selected_object_info with "od 337cf025" 
so shards have "omap_digest_mismatch_oi" except for osd 990.


The pg repair code will use osd.990 to fix the other 2 copies without 
further handling.



David



On 9/8/17 11:16 AM, Robin H. Johnson wrote:

On Thu, Sep 07, 2017 at 08:24:04PM +, Robin H. Johnson wrote:

pg 5.3d40 is active+clean+inconsistent, acting [1322,990,655]
pg 5.f1c0 is active+clean+inconsistent, acting [631,1327,91]

Here is the output of 'rados list-inconsistent-obj' for the PGs:

$ sudo rados list-inconsistent-obj 5.f1c0 |json_pp -json_opt canonical,pretty
{
"epoch" : 1221254,
"inconsistents" : [
   {
  "errors" : [
 "omap_digest_mismatch"
  ],
  "object" : {
 "locator" : "",
 "name" : ".dir.default.292886573.13181.12",
 "nspace" : "",
 "snap" : "head",
 "version" : 483490
  },
  "selected_object_info" : 
"5:038f1cff:::.dir.default.292886573.13181.12:head(1221843'483490 client.417313345.0:19515832 
dirty|omap|data_digest s 0 uv 483490 dd  alloc_hint [0 0])",
  "shards" : [
 {
"data_digest" : "0x",
"errors" : [],
"omap_digest" : "0x928b0c0b",
"osd" : 91,
"size" : 0
 },
 {
"data_digest" : "0x",
"errors" : [],
"omap_digest" : "0x928b0c0b",
"osd" : 631,
"size" : 0
 },
 {
"data_digest" : "0x",
"errors" : [],
"omap_digest" : "0x6556c868",
"osd" : 1327,
"size" : 0
 }
  ],
  "union_shard_errors" : []
   }
]
}
$ sudo rados list-inconsistent-obj 5.3d40  |json_pp -json_opt canonical,pretty
{
"epoch" : 1210895,
"inconsistents" : [
   {
  "errors" : [
 "omap_digest_mismatch"
  ],
  "object" : {
 "locator" : "",
 "name" : ".dir.default.64449186.344176",
 "nspace" : "",
 "snap" : "head",
 "version" : 1177199
  },
  "selected_object_info" : 
"5:02bc4def:::.dir.default.64449186.344176:head(1177700'1180639 osd.1322.0:537914 
dirty|omap|data_digest|omap_digest s 0 uv 1177199 dd  od 337cf025 alloc_hint [0 0])",
  "shards" : [
 {
"data_digest" : "0x",
"errors" : [
   "omap_digest_mismatch_oi"
],
"omap_digest" : "0x3242b04e",
"osd" : 655,
"size" : 0
 },
 {
"data_digest" : "0x",
"errors" : [],
"omap_digest" : "0x337cf025",
"osd" : 990,
"size" : 0
 },
 {
"data_digest" : "0x",
"errors" : [
   "omap_digest_mismatch_oi"
],
"omap_digest" : "0xc90d06a8",
"osd" : 1322,
"size" : 0
 }
  ],
  "union_shard_errors" : [
 "omap_digest_mismatch_oi"
  ]
   }
]
}





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD's flapping on ordinary scrub with cluster being static (after upgrade to 12.1.1

2017-08-29 Thread David Zafman


Please file a bug in tracker: http://tracker.ceph.com/projects/ceph

When an OSD is marked down is there are a crash (e.g. assert, heartbeat 
timeout, declared down by another daemon)?  Please include relevant log 
snippets.  If no obvious information, then bump osd debug log levels.


Luminous LTS release happened today, so 12.2.0 is the best thing to run 
as of now.


See if any existing bugs like http://tracker.ceph.com/issues/21142 are 
related.


David


On 8/29/17 8:24 AM, Tomasz Kusmierz wrote:

So nobody has any clue on this one ???

Should I go with this one to dev mailing list ?


On 27 Aug 2017, at 01:49, Tomasz Kusmierz  wrote:

Hi,
for purposes of experimenting I’m running a home cluster that consists of 
single node and 4 OSD (weights in crush map are true to actual hdd size). I 
prefer to test all new stuff on home equipment before getting egg in the face 
at work :)
Anyway recently I’ve upgrade to Luminous, and replaced my ancient 8x 2TB drives 
with 2x 8TB drives (with hopes of getting more in near future). While doing 
that I’ve converted everything to bluestore. while still on 12.1.1

Everything was running smooth and performance was good (for ceph).

I’ve decided to upgrade recently to 12.1.2 and this is where everything started 
acting up. I’m aware that
- single node cluster is not a cluster
- in the end I might need more OSD (old joke right ?)
- I need to switch from spinning rust to SSD

Before upgrade my “cluster” was only switching to WRN only when I was pumping a 
lot of data into it and it would just come up with “slow requests” stuff. Now 
while completely static, not doing anything (no read, no write) OSD’s are 
committing suicide due to timeout, also before they will commit suicide I can’t 
actually access data from cluster, which make me think that while performing a 
scrub those are unaccessible. Bellow I’ll attach a log excerpt just please 
notice that it happens on deep scrub and normal scrub as well.

After I’ve discovered that I’ve tried to play around with sysctl.conf and with 
ceph.conf ( up to this point sysctl.conf was stock, and ceph.conf was just 
adjusted to allow greater OSD full capacity and disable cephx to speed it up)

also I’m running 3 pools on top of this cluster (all three have size = 2 
min_size = 2):
cephfs_data pg=256 (99.99% of data used in cluster)
cephfs_metadata pg=4 (0.01% of data used in cluster)
rbd pg=8 but this pool contains no data and I’m considering removing it since 
in my use case I’ve got nothing for it.

Please note that while this logs were produced cephFS was not even mounted :/



FYI hardware is old and trusted hp proliant DL180 G6 with 2 xeons @2.2GHz 
giving 16 cores and 32GB or ECC ram and LSI in HBA mode (2x 6GB SAS)



(
As a side issue could somebody explain to my why with bluestore that was supposed to 
cure cancer write performance still sucks ? I know that filestore did suffer from 
writing everything multiple times to same drive, and I did experience this first hand 
when after exhausting journals it was just dead slow, but now while within same host 
in my current configuration it keeps choking [flaps 70MB/s -> 10 MB/s -> 
70MB/s] and I never seen it even approach speed of single slowest drive. This server 
is not a speed daemon, I know, but when performing a simultaneous read / write for 
those drives I was getting around 760MB/s sequential R/W speed.
Right now I’m struggling to comprehend where the bottleneck is while performing 
operations within same host ?! network should not be an issue (correct me if 
I’m wrong here), dumping a singular blob into pool should produce a nice long 
sequence of object placed into drives …
I’m just puzzled why ceph will not exceed combined 40MB/s while still switching 
“cluster” into warning state due to “slow responses”
2017-08-24 20:49:34.457191 osd.8 osd.8 192.168.1.240:6814/3393 503 : cluster 
[WRN] slow request 63.878717 seconds old, received at 2017-08-24 
20:48:30.578398: osd_op(client.994130.1:13659 1.9700016d 
1:b68000e9:::10ffeef.0068:head [write 0~4194304 [1@-1]] snapc 1=[] 
ondisk+write+known_if_redirected e4306) currently waiting for active
2017-08-24 20:49:34.457195 osd.8 osd.8 192.168.1.240:6814/3393 504 : cluster 
[WRN] slow request 64.177858 seconds old, received at 2017-08-24 
20:48:30.279257: osd_op(client.994130.1:13568 1.b95e13a4 
1:25c87a9d:::10ffeef.000d:head [write 0~4194304 [1@-1]] snapc 1=[] 
ondisk+write+known_if_redirected e4306) currently waiting for active
2017-08-24 20:49:34.457198 osd.8 osd.8 192.168.1.240:6814/3393 505 : cluster 
[WRN] slow request 64.002653 seconds old, received at 2017-08-24 
20:48:30.454463: osd_op(client.994130.1:13626 1.b426420e 
1:7042642d:::10ffeef.0047:head [write 0~4194304 [1@-1]] snapc 1=[] 
ondisk+write+known_if_redirected e4306) currently waiting for active
2017-08-24 20:49:34.457200 osd.8 osd.8 192.168.1.240:6814/3393 506 : cluster 
[WRN] slow request 63.873519 seconds old, 

Re: [ceph-users] cephfs metadata damage and scrub error

2017-05-02 Thread David Zafman


James,

You have an omap corruption.  It is likely caused by a bug which 
has already been identified.  A fix for that problem is available but it 
is still pending backport for the next Jewel point release.  All 4 of 
your replicas have different "omap_digest" values.


Instead of the xattrs the ceph-osdomap-tool --command 
dump-objects-with-keys output from OSDs 3, 10, 11, 23 would be 
interesting to compare.


***WARNING*** Please backup your data before doing any repair attempts.

If you can upgrade to Kraken v11.2.0, it will auto repair the omaps on 
ceph-osd start up.  It will likely still require a ceph pg repair to 
make the 4 replicas consistent with each other.  The final result may be 
the reappearance of removed MDS files in the directory.


If you can recover the data, you could remove the directory entirely and 
rebuild it.  The original bug was triggered during omap deletion 
typically in a large directory which corresponds to an individual unlink 
in cephfs.


If you can build a branch in github to get the newer ceph-osdomap-tool 
you could try to use it to repair the omaps.


David



On 5/2/17 5:05 AM, James Eckersall wrote:

Hi,

I'm having some issues with a ceph cluster.  It's an 8 node cluster rnning
Jewel ceph-10.2.7-0.el7.x86_64 on CentOS 7.
This cluster provides RBDs and a CephFS filesystem to a number of clients.

ceph health detail is showing the following errors:

pg 2.9 is active+clean+inconsistent, acting [3,10,11,23]
1 scrub errors
mds0: Metadata damage detected


The pg 2.9 is in the cephfs_metadata pool (id 2).

I've looked at the OSD logs for OSD 3, which is the primary for this PG,
but the only thing that appears relating to this PG is the following:

log_channel(cluster) log [ERR] : 2.9 deep-scrub 1 errors

After initiating a ceph pg repair 2.9, I see the following in the primary
OSD log:

log_channel(cluster) log [ERR] : 2.9 repair 1 errors, 0 fixed
log_channel(cluster) log [ERR] : 2.9 deep-scrub 1 errors


I found the below command in a previous ceph-users post.  Running this
returns the following:

# rados list-inconsistent-obj 2.9
{"epoch":23738,"inconsistents":[{"object":{"name":"1411194.","nspace":"","locator":"","snap":"head","version":14737091},"errors":["omap_digest_mismatch"],"union_shard_errors":[],"selected_object_info":"2:9758b358:::1411194.:head(33456'14737091
mds.0.214448:248532 dirty|omap|data_digest s 0 uv 14737091 dd
)","shards":[{"osd":3,"errors":[],"size":0,"omap_digest":"0x6748eef3","data_digest":"0x"},{"osd":10,"errors":[],"size":0,"omap_digest":"0xa791d5a4","data_digest":"0x"},{"osd":11,"errors":[],"size":0,"omap_digest":"0x53f46ab0","data_digest":"0x"},{"osd":23,"errors":[],"size":0,"omap_digest":"0x97b80594","data_digest":"0x"}]}]}


So from this, I think that the object in PG 2.9 with the problem is
1411194..

This is what I see on the filesystem on the 4 OSD's this PG resides on:

-rw-r--r--. 1 ceph ceph 0 Apr 27 12:31
/var/lib/ceph/osd/ceph-3/current/2.9_head/DIR_9/DIR_E/DIR_A/DIR_1/1411194.__head_1ACD1AE9__2
-rw-r--r--. 1 ceph ceph 0 Apr 15 22:05
/var/lib/ceph/osd/ceph-10/current/2.9_head/DIR_9/DIR_E/DIR_A/DIR_1/1411194.__head_1ACD1AE9__2
-rw-r--r--. 1 ceph ceph 0 Apr 15 22:07
/var/lib/ceph/osd/ceph-11/current/2.9_head/DIR_9/DIR_E/DIR_A/DIR_1/1411194.__head_1ACD1AE9__2
-rw-r--r--. 1 ceph ceph 0 Apr 16 03:58
/var/lib/ceph/osd/ceph-23/current/2.9_head/DIR_9/DIR_E/DIR_A/DIR_1/1411194.__head_1ACD1AE9__2

The extended attrs are as follows, although I have no idea what any of them
mean.

# file:
var/lib/ceph/osd/ceph-11/current/2.9_head/DIR_9/DIR_E/DIR_A/DIR_1/1411194.__head_1ACD1AE9__2
user.ceph._=0sDwj5BAM1ABQxMDAwMDQxMTE5NC4wMDAwMDAwMP7/6RrNGgAAAgAGAxwCAP8AAP//ABUn4QAAu4IAAK4m4QAAu4IAAAICFQIAAOSZDAAAsEUDjUoIWUgWsQQCAhUVJ+EAABwAAACNSghZESm8BP///w==
user.ceph._@1=0s//8=
user.ceph._layout=0sAgIY//8A
user.ceph._parent=0sBQRPAQAAlBFBAAABAAAIAgIjjxFBAAABAAAPdHViZWFtYXRldXIubmV0qdgCAh0AAAB/EUEAAAEAAAkAAAB3cC1yb2NrZXREAAICGQAAABYNQQAAAQAABQAAAGNhY2hlUgACAh4QDUEAAAEAAAoAAAB3cC1jb250ZW50NAMCAhgNDUEAAAEAAAQAAABodG1sIAECAikAAADagTMAAAEAABUAAABuZ2lueC1waHA3LWNsdmdmLWRhdGGJAAICMwAAADkAAQ==
user.ceph._parent@1
=0sAAAfNDg4LTU3YjI2NTdmMmZhMTMtbWktcHJveWVjdG8tMXSQCgIcAQAIcHJvamVjdHPBAgcAAAIAAA==
user.ceph.snapset=0sAgIZAAABAA==
user.cephos.seq=0sAQEQgcAqFgA=
user.cephos.spill_out=0sMAA=getfattr: Removing leading '/' from absolute
path names

# file:

Re: [ceph-users] How safe is ceph pg repair these days?

2017-02-21 Thread David Zafman


Nick,

Yes, as you would expect a read error would not be used as a source 
for repair no matter which OSD(s) are getting read errors.



David

On 2/21/17 12:38 AM, Nick Fisk wrote:

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Gregory Farnum
Sent: 20 February 2017 22:13
To: Nick Fisk <n...@fisk.me.uk>; David Zafman <dzaf...@redhat.com>
Cc: ceph-users <ceph-us...@ceph.com>
Subject: Re: [ceph-users] How safe is ceph pg repair these days?

On Sat, Feb 18, 2017 at 12:39 AM, Nick Fisk <n...@fisk.me.uk> wrote:

 From what I understand in Jewel+ Ceph has the concept of an
authorative shard, so in the case of a 3x replica pools, it will
notice that 2 replicas match and one doesn't and use one of the good
replicas. However, in a 2x pool your out of luck.

However, if someone could confirm my suspicions that would be good as

well.

Hmm, I went digging in and sadly this isn't quite right. The code has a

lot of

internal plumbing to allow more smarts than were previously feasible and
the erasure-coded pools make use of them for noticing stuff like local
corruption. Replicated pools make an attempt but it's not as reliable as

one

would like and it still doesn't involve any kind of voting mechanism.
A self-inconsistent replicated primary won't get chosen. A primary is

self-

inconsistent when its digest doesn't match the data, which happens when:
1) the object hasn't been written since it was last scrubbed, or
2) the object was written in full, or
3) the object has only been appended to since the last time its digest was
recorded, or
4) something has gone terribly wrong in/under LevelDB and the omap entries
don't match what the digest says should be there.


Thanks for the correction Greg. So I'm guessing that the probability of
overwriting with an incorrect primary is reduced in later releases, but it
can still happen.

Quick question and its maybe that this is a #5 on your list. What about
objects that are marked inconsistent on the primary due to a read error. I
would say 90% of my inconsistent PG's are always caused by a read error and
associated smartctl error.

"rados list-inconsistent-obj" shows that it knows that the primary had a
read error, so I assume a "pg repair" wouldn't try and read from the primary
again?


David knows more and correct if I'm missing something. He's also working

on

interfaces for scrub that are more friendly in general and allow
administrators to make more fine-grained decisions about recovery in ways
that cooperate with RADOS.
-Greg


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
Of Tracy Reed
Sent: 18 February 2017 03:06
To: Shinobu Kinjo <ski...@redhat.com>
Cc: ceph-users <ceph-us...@ceph.com>
Subject: Re: [ceph-users] How safe is ceph pg repair these days?

Well, that's the question...is that safe? Because the link to the
mailing

list

post (possibly outdated) says that what you just suggested is
definitely

NOT

safe. Is the mailing list post wrong? Has the situation changed?
Exactly

what

does ceph repair do now? I suppose I could go dig into the code but
I'm

not

an expert and would hate to get it wrong and post possibly bogus info
the the list for other newbies to find and worry about and possibly
lose their data.

On Fri, Feb 17, 2017 at 06:08:39PM PST, Shinobu Kinjo spake thusly:

if ``ceph pg deep-scrub `` does not work then
   do
 ``ceph pg repair 


On Sat, Feb 18, 2017 at 10:02 AM, Tracy Reed
<tr...@ultraviolet.org>

wrote:

I have a 3 replica cluster. A couple times I have run into
inconsistent PGs. I googled it and ceph docs and various blogs
say run a repair first. But a couple people on IRC and a mailing
list thread from 2015 say that ceph blindly copies the primary
over the secondaries and calls it good.

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-

May/001370.

html

I sure hope that isn't the case. If so it would seem highly
irresponsible to implement such a naive command called "repair".
I have recently learned how to properly analyze the OSD logs and
manually fix these things but not before having run repair on a
dozen inconsistent PGs. Now I'm worried about what sort of
corruption I may have introduced. Repairing things by hand is a
simple heuristic based on comparing the size or checksum (as
indicated by the logs) for each of the 3 copies and figuring out
which is correct. Presumably matching two out of three should win
and the odd object out should be deleted since having the exact
same kind of error on two different OSDs is highly improbable. I
don't understand why ceph repair wouldn't have done this all along.

What is the current best practice in the use of ceph repair?

Thanks!

--
Tracy Reed

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-

Re: [ceph-users] Listing out the available namespace in the Ceph Cluster

2016-11-23 Thread David Zafman


Hi Janmejay,

Sorry I just found you e-mail in my inbox.

There is no list namespaces, but rather you can list all objects in all 
namespaces using the --all option and filter the results.


I created 10 namespaces (ns1 - ns10) in addition to the default one.

rados -p testpool --all ls --format=json | jq '.[].namespace' | sort -u

""
"ns1"
"ns10"
"ns2"
"ns3"
"ns4"
"ns5"
"ns6"
"ns7"
"ns8"
"ns9"

David


On 11/15/16 12:28 AM, Janmejay Baral wrote:

*Dear Mr. David,*

I have been using Ceph since 1.5 yrs. Now recently we have upgraded our
Ceph cluster to Jewel from Hammer. Now for the testing purpose I need to
check with all the available Namespaces.  Can you please help me with the
query to find out the the Namespaces we have created ?





*Thanks & Regards,Janmejay Baral*


*(Actiance India Pvt. Ltd.)Sr. Software Engineer+91-9739741384*



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs refuse to start, latest osdmap missing

2016-04-15 Thread David Zafman


The ceph-objectstore-tool set-osdmap operation updates existing 
osdmaps.  If a map doesn't already exist the --force option can be used 
to create it.  It appears safe in your case to use that option.


David

On 4/15/16 9:47 AM, Markus Blank-Burian wrote:

Hi,

  


we had a problem on our production cluster (running 9.2.1) which caused /proc,
/dev and /sys to be unmounted. During this time, we received the following error
on a large number of OSDs (for various osdmap epochs):

  


Apr 15 15:25:19 kaa-99 ceph-osd[4167]: 2016-04-15 15:25:19.457774 7f1c817fd700
0 filestore(/local/ceph/osd.43) write couldn't open
meta/-1/c188e154/osdmap.276293/0: (2) No such file or directory

  


After restarting the hosts, the OSDs now refuse to start with:

  


Apr 15 16:03:53 kaa-99 ceph-osd[4211]: -2> 2016-04-15 16:03:53.089842
7f8e9f840840 10 _load_class version success

Apr 15 16:03:53 kaa-99 ceph-osd[4211]: -1> 2016-04-15 16:03:53.089863
7f8e9f840840 20 osd.43 0 get_map 276424 - loading and decoding 0x7f8e9b841780

Apr 15 16:03:53 kaa-99 ceph-osd[4211]:  0> 2016-04-15 16:03:53.140754
7f8e9f840840 -1 osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)'
thread 7f8e9f840840 time 2016-04-15 16:03:53.139563

osd/OSD.h: 847: FAILED assert(ret)

  


Inserting the map with ceph-objectstore-tool –op set-osdmap does not work and
gives the following error:

  


osdmap (-1/c1882e94/osdmap.276507/0) does not exist.

2016-04-15 17:14:00.335751 7f4b4d75b840  1 journal close /dev/ssd/journal.43

  


How can I get the OSDs running again?

  


I also created an issue for this in the tracker:
http://tracker.ceph.com/issues/15520

There are some similar entries, but I could not find a solution without
recreating the OSD.

  

  


Markus

  


--

Markus Blank-Burian

AK Heuer, Institut für Physikalische Chemie, WWU Münster

Corrensstraße 28/30

Raum E005

Tel.: 0251 / 83 29178

E-Mail:   blankbur...@wwu.de

  

  





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] recorded data digest != on disk

2016-03-23 Thread David Zafman

On 3/23/16 7:45 AM, Gregory Farnum wrote:

On Tue, Mar 22, 2016 at 11:59 AM, Max A. Krasilnikov
 wrote:

Hello!

On Tue, Mar 22, 2016 at 11:40:39AM -0700, gfarnum wrote:


On Tue, Mar 22, 2016 at 1:19 AM, Max A. Krasilnikov  wrote:

 -1> 2016-03-21 17:36:09.048201 7f253f912700 -1 log_channel(cluster) log 
[ERR] : 5.ca recorded data digest 0xb284fef9 != on disk 0x43d61c5d on 
6134ccca/rbd_data.86280c78aaf7da.000e0bb5/17//5
  0> 2016-03-21 17:36:09.050672 7f253f912700 -1 osd/osd_types.cc: In 
function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread 7f253f912700 
time 2016-03-21 17:36:09.048341
osd/osd_types.cc: 4103: FAILED assert(clone_size.count(clone))

This is the part causing crashes, not the data digest. Searching for
that error led me to http://tracker.ceph.com/issues/12954

So, I can expect fixing this in future releases of hammer? As I can see, it is
merged now...

Hmm, it looks like it wasn't marked for backport and it might have
been a little complicated, but it's also the sort of thing I might
expect to see in an LTS release. David? :)
-Greg


Yes, it is merged to hammer already and should be part of the next 
hammer point release.


This was part of a large set of commits for multiple inter-related pull 
requests that needed to be backported together.


David
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Recovery Assistance, pgs stuck peering

2016-03-08 Thread David Zafman


I expected it to return to osd.36.  Oh, if you set "noout" during this 
process then the pg won't move around when you down osd.36.  I expected 
osd.36 to go down and back up quickly.


Also, the pg 10.4f is the same situation, so try the same thing on osd.6.

David

On 3/8/16 1:05 PM, Ben Hines wrote:

After making that setting, the pg appeared to start peering but then it
actually changed the primary OSD to osd.100 - then went incomplete again.
Perhaps it did that because another OSD had more data? I presume i need to
set that value on each osd where the pg hops to.

-Ben

On Tue, Mar 8, 2016 at 10:39 AM, David Zafman <dzaf...@redhat.com> wrote:


Ben,

I haven't look at everything in your message, but pg 12.7a1 has lost data
because of writes that went only to osd.73.  The way to recover this is to
force recovery to ignore this fact and go with whatever data you have on
the remaining OSDs.
I assume that having min_size 1, having multiple nodes failing and clients
continuing to write then permanently losing osd.73 caused this.

You should TEMPORARILY set osd_find_best_info_ignore_history_les config
variable to 1 on osd.36 and then mark it down (ceph osd down), so it will
rejoin, re-peer and mark the pg active+clean.  Don't forget to set
osd_find_best_info_ignore_history_les
back to 0.


Later you should fix your crush map.  See
http://docs.ceph.com/docs/master/rados/operations/crush-map/

The wrong placements makes you vulnerable to a single host failure taking
out multiple copies of an object.

David


On 3/7/16 9:41 PM, Ben Hines wrote:

Howdy,

I was hoping someone could help me recover a couple pgs which are causing
problems in my cluster. If we aren't able to resolve this soon, we may have
to just destroy them and lose some data. Recovery has so far been
unsuccessful. Data loss would probably cause some here to reconsider Ceph
as something we'll stick with long term, so i'd love to recover it.

Ceph 9.2.1. I have 4 (well, 3 now) pgs which are incomplete + stuck peering
after a disk failure

pg 12.7a1 query: https://gist.github.com/benh57/ba4f96103e1f6b3b7a4d
pg 12.7b query: https://gist.github.com/benh57/8db0bfccc5992b9ca71a
pg 10.4f query:  https://gist.github.com/benh57/44bdd2a19ea667d920ab
ceph osd tree: https://gist.github.com/benh57/9fc46051a0f09b6948b7

- The bad OSD (osd-73) was on mtl-024. There were no 'unfound' objects when
it went down, the pg was 'down + peering'. It was marked lost.
- After marking 73 lost, the new primary still wants to peer and flips
between peering and incomplete.
- Noticed '73' still shows in the pg query output for the bad pgs. (maybe i
need to bring back an osd with the same name?)
- Noticed that the new primary got set to an osd (osd-77) which was on the
same node as (osd-76) which had all the data.  Figuring 77 couldn't peer
with 36 because it was on the same node, i set 77 out, 36 became primary
and 76 became one of the replicas. No change.

startup logs of Primaries of bad pgs (12.7a1, 10.4f) with 'debug osd = 20,
debug filestore = 30, debug ms = 1'  (large files)

osd 36 (12.7a1) startup 
log:https://raw.githubusercontent.com/benh57/cephdebugging/master/ceph-osd.36.log
osd 6 (10.4f) startup 
log:https://raw.githubusercontent.com/benh57/cephdebugging/master/ceph-osd.6.log


Some other Notes:

- Searching for OSDs which had data in 12.7a1_head, i found that osd-76 has
12G, but primary osd-36 has 728M. Another OSD which is out (100) also has a
copy of the data.  Even after running a pg repair does not pick up the data
from 76, remains stuck peering

- One of the pgs was part of a pool which was no longer needed. (the unused
radosgw .rgw.control pool, with one 0kb object in it) Per previous steps
discussed here for a similar failure, i attempted these recovery steps on
it, to see if they would work for the others:

-- The failed osd disk only mounts 'read only' which causes
ceph-objectstore-tool to fail to export, so i exported it from a seemingly
good copy on another osd.
-- stopped all osds
-- exported the pg with objectstore-tool from an apparently good OSD
-- removed the pg from all osds which had it using objectstore-tool
-- imported the pg into an out osd, osd-100

   Importing pgid 4.95
Write 4/88aa5c95/notify.2/head
Import successful

-- Force recreated the pg on the cluster:
ceph pg force_create_pg 4.95
-- brought up all osds
-- new pg 4.95 primary gets set to osd-99 + osd-64, 0 objects

However, the object doesn't sync to the pg from osd-100, and instead 64
tells to to remove itself from osd-100:

2016-03-05 15:44:22.858147 7fc004168700 20 osd.100 68025 _dispatch
0x7fc020867660 osd pg remove(epoch 68025; pg4.95; ) v2
2016-03-05 15:44:22.858174 7fc004168700  7 osd.100 68025 handle_pg_remove
from osd.64 on 1 pgs
2016-03-05 15:44:22.858176 7fc004168700 15 osd.100 68025
require_same_or_newer_map 68025 (i am 68025) 0x7fc020867660
2016-03-05 15:44:22.858188 7fc004168700  5 osd.100 68025
queue_pg_for_deletion: 4.95
2016-03-05

Re: [ceph-users] Ceph Recovery Assistance, pgs stuck peering

2016-03-08 Thread David Zafman


Ben,

I haven't look at everything in your message, but pg 12.7a1 has lost 
data because of writes that went only to osd.73.  The way to recover 
this is to force recovery to ignore this fact and go with whatever data 
you have on the remaining OSDs.
I assume that having min_size 1, having multiple nodes failing and 
clients continuing to write then permanently losing osd.73 caused this.


You should TEMPORARILY set osd_find_best_info_ignore_history_les config 
variable to 1 on osd.36 and then mark it down (ceph osd down), so it 
will rejoin, re-peer and mark the pg active+clean. Don't forget to set 
osd_find_best_info_ignore_history_les

back to 0.


Later you should fix your crush map.  See 
http://docs.ceph.com/docs/master/rados/operations/crush-map/


The wrong placements makes you vulnerable to a single host failure 
taking out multiple copies of an object.


David

On 3/7/16 9:41 PM, Ben Hines wrote:

Howdy,

I was hoping someone could help me recover a couple pgs which are causing
problems in my cluster. If we aren't able to resolve this soon, we may have
to just destroy them and lose some data. Recovery has so far been
unsuccessful. Data loss would probably cause some here to reconsider Ceph
as something we'll stick with long term, so i'd love to recover it.

Ceph 9.2.1. I have 4 (well, 3 now) pgs which are incomplete + stuck peering
after a disk failure

pg 12.7a1 query: https://gist.github.com/benh57/ba4f96103e1f6b3b7a4d
pg 12.7b query: https://gist.github.com/benh57/8db0bfccc5992b9ca71a
pg 10.4f query:  https://gist.github.com/benh57/44bdd2a19ea667d920ab
ceph osd tree: https://gist.github.com/benh57/9fc46051a0f09b6948b7

- The bad OSD (osd-73) was on mtl-024. There were no 'unfound' objects when
it went down, the pg was 'down + peering'. It was marked lost.
- After marking 73 lost, the new primary still wants to peer and flips
between peering and incomplete.
- Noticed '73' still shows in the pg query output for the bad pgs. (maybe i
need to bring back an osd with the same name?)
- Noticed that the new primary got set to an osd (osd-77) which was on the
same node as (osd-76) which had all the data.  Figuring 77 couldn't peer
with 36 because it was on the same node, i set 77 out, 36 became primary
and 76 became one of the replicas. No change.

startup logs of Primaries of bad pgs (12.7a1, 10.4f) with 'debug osd = 20,
debug filestore = 30, debug ms = 1'  (large files)

osd 36 (12.7a1) startup log:
https://raw.githubusercontent.com/benh57/cephdebugging/master/ceph-osd.36.log
osd 6 (10.4f) startup log:
https://raw.githubusercontent.com/benh57/cephdebugging/master/ceph-osd.6.log


Some other Notes:

- Searching for OSDs which had data in 12.7a1_head, i found that osd-76 has
12G, but primary osd-36 has 728M. Another OSD which is out (100) also has a
copy of the data.  Even after running a pg repair does not pick up the data
from 76, remains stuck peering

- One of the pgs was part of a pool which was no longer needed. (the unused
radosgw .rgw.control pool, with one 0kb object in it) Per previous steps
discussed here for a similar failure, i attempted these recovery steps on
it, to see if they would work for the others:

-- The failed osd disk only mounts 'read only' which causes
ceph-objectstore-tool to fail to export, so i exported it from a seemingly
good copy on another osd.
-- stopped all osds
-- exported the pg with objectstore-tool from an apparently good OSD
-- removed the pg from all osds which had it using objectstore-tool
-- imported the pg into an out osd, osd-100

   Importing pgid 4.95
Write 4/88aa5c95/notify.2/head
Import successful

-- Force recreated the pg on the cluster:
ceph pg force_create_pg 4.95
-- brought up all osds
-- new pg 4.95 primary gets set to osd-99 + osd-64, 0 objects

However, the object doesn't sync to the pg from osd-100, and instead 64
tells to to remove itself from osd-100:

2016-03-05 15:44:22.858147 7fc004168700 20 osd.100 68025 _dispatch
0x7fc020867660 osd pg remove(epoch 68025; pg4.95; ) v2
2016-03-05 15:44:22.858174 7fc004168700  7 osd.100 68025 handle_pg_remove
from osd.64 on 1 pgs
2016-03-05 15:44:22.858176 7fc004168700 15 osd.100 68025
require_same_or_newer_map 68025 (i am 68025) 0x7fc020867660
2016-03-05 15:44:22.858188 7fc004168700  5 osd.100 68025
queue_pg_for_deletion: 4.95
2016-03-05 15:44:22.858228 7fc004168700 15 osd.100 68025 project_pg_history
4.95 from 68025 to 68025, start ec=76 les/c/f 62655/62611/0
66982/67983/66982

Not wanting this to happen to my needed data from the other PGs, i didn't
try this procedure with those PGs. After this procedure  osd-100 does get
listed in 'pg query' as 'might_have_unfound', but ceph apparently decides
not to use it and the active osd sends a remove.

output of 'ceph pg 4.95 query' after these recovery steps:
https://gist.github.com/benh57/fc9a847cd83f4d5e4dcf


Quite Possibly Related:

I am occasionally noticing some incorrectness in 'ceph osd tree'. It seems
my crush map thinks some osds are on 

Re: [ceph-users] 答复: How long will the logs be kept?

2015-12-07 Thread David Zafman


dout() is used for an OSD to log information about what it is doing 
locally and might become very chatty.  It is saved on the local nodes 
disk only.


clog is the cluster log and is used for major events that should be 
known by the administrator (see ceph -w).  Clog should be used sparingly 
as it sends the messages to the monitor.


David

On 12/3/15 4:36 AM, Wukongming wrote:

OK! One more question. Do you know why ceph has 2 ways outputting logs(dout && 
clog). Cause I find dout is more helpful than clog, Did ceph use clog first, and dout 
added for new version?

-
wukongming ID: 12019
Tel:0571-86760239
Dept:2014 UIS2 ONEStor

-邮件原件-
发件人: Jan Schermer [mailto:j...@schermer.cz]
发送时间: 2015年12月3日 16:58
收件人: wukongming 12019 (RD)
抄送: huang jun; ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com
主题: Re: [ceph-users] How long will the logs be kept?

You can setup logrotate however you want - not sure what the default is for 
your distro.
Usually logrotate doesn't touch files that are smaller than some size even if 
they are old. It will also not delete logs for OSDs that no longer exist.

Ceph itself has nothing to do with log rotation, logrotate does the work. Ceph 
packages likely contain default logrotate rules for the logs but you can edit 
them to your liking.

Jan


On 03 Dec 2015, at 09:38, Wukongming  wrote:

Yes, I can find ceph of rotate configure file in the directory of 
/etc/logrotate.d.
Also, I find sth. Weird.

drwxr-xr-x  2 root root   4.0K Dec  3 14:54 ./
drwxrwxr-x 19 root syslog 4.0K Dec  3 13:33 ../
-rw---  1 root root  0 Dec  2 06:25 ceph.audit.log
-rw---  1 root root85K Nov 25 09:17 ceph.audit.log.1.gz
-rw---  1 root root   228K Dec  3 16:00 ceph.log
-rw---  1 root root28K Dec  3 06:23 ceph.log.1.gz
-rw---  1 root root   374K Dec  2 06:22 ceph.log.2.gz
-rw-r--r--  1 root root   4.3M Dec  3 16:01 ceph-mon.wkm01.log
-rw-r--r--  1 root root   561K Dec  3 06:25 ceph-mon.wkm01.log.1.gz
-rw-r--r--  1 root root   2.2M Dec  2 06:25 ceph-mon.wkm01.log.2.gz
-rw-r--r--  1 root root  0 Dec  2 06:25 ceph-osd.0.log
-rw-r--r--  1 root root992 Dec  1 09:09 ceph-osd.0.log.1.gz
-rw-r--r--  1 root root19K Dec  3 10:51 ceph-osd.2.log
-rw-r--r--  1 root root   2.3K Dec  2 10:50 ceph-osd.2.log.1.gz
-rw-r--r--  1 root root27K Dec  1 10:31 ceph-osd.2.log.2.gz
-rw-r--r--  1 root root13K Dec  3 10:23 ceph-osd.5.log
-rw-r--r--  1 root root   1.6K Dec  2 09:57 ceph-osd.5.log.1.gz
-rw-r--r--  1 root root22K Dec  1 09:51 ceph-osd.5.log.2.gz
-rw-r--r--  1 root root19K Dec  3 10:51 ceph-osd.8.log
-rw-r--r--  1 root root18K Dec  2 10:50 ceph-osd.8.log.1
-rw-r--r--  1 root root   261K Dec  1 13:54 ceph-osd.8.log.2

I deployed ceph cluster on Nov 21, from that day to Dec.1, I mean the continue 
10 days' logs were compressed into one file, it is not what I want.
Does any OP affect log compressing?

Thanks!
Kongming Wu
-
wukongming ID: 12019
Tel:0571-86760239
Dept:2014 UIS2 ONEStor

-邮件原件-
发件人: huang jun [mailto:hjwsm1...@gmail.com]
发送时间: 2015年12月3日 13:19
收件人: wukongming 12019 (RD)
抄送: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com
主题: Re: How long will the logs be kept?

it will rotate every week by default, you can see the logrotate file
/etc/ceph/logrotate.d/ceph

2015-12-03 12:37 GMT+08:00 Wukongming :

Hi ,All
Is there anyone who knows How long or how many days will the logs.gz 
(mon/osd/mds)be kept, maybe before flushed?

-
wukongming ID: 12019
Tel:0571-86760239
Dept:2014 UIS2 OneStor

-
-
---
本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出
的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、
或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本
邮件!
This e-mail and its attachments contain confidential information from
H3C, which is intended only for the person or entity whose address is
listed above. Any use of the information contained herein in any way
(including, but not limited to, total or partial disclosure,
reproduction, or dissemination) by persons other than the intended
recipient(s) is prohibited. If you receive this e-mail in error,
please notify the sender by phone or email immediately and delete it!



--
thanks
huangjun
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

N�r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj"��!tml=


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Core dump when running OSD service

2015-10-22 Thread David Zafman


I was focused on fixing the OSD, but you need to determine if some 
misconfiguration or hardware issue caused a filesystem corruption.


David

On 10/22/15 3:08 PM, David Zafman wrote:


There is a corruption of the osdmaps on this particular OSD.  You need 
determine which maps are bad probably by bumping the osd debug level 
to 20.  Then transfer them from a working OSD.  The newest 
ceph-objectstore-tool has features to write the maps, but you'll need 
to build a version based on a v0.94.4 source tree.  I don't know if 
you can just copy files with names like 
"current/meta/osdmap.8__0_FD6E4D61__none" (map for epoch 8) between OSDs.


David

On 10/21/15 8:54 PM, James O'Neill wrote:
I have an OSD that didn't come up after a reboot. I was getting the 
error show below. it was running 0.94.3 so I reinstalled all 
packages. I then upgraded everything to 0.94.4 hoping that would fix 
it but it hasn't. There are three OSDs, this is the only one having 
problems (it also contains the inconsistent pgs). Can anyone tell me 
what the problem might be?



root@dbp-ceph03:/srv/data# ceph status
   cluster 4f6fb784-bd17-4105-a689-e8d1b4bc5643
health HEALTH_ERR
   53 pgs inconsistent
   542 pgs stale
   542 pgs stuck stale
   5 requests are blocked > 32 sec
   85 scrub errors
   too many PGs per OSD (544 > max 300)
   noout flag(s) set
monmap e3: 3 mons at 
{dbp-ceph01=172.17.241.161:6789/0,dbp-ceph02=172.17.241.162:6789/0,dbp-ceph03=172.17.241.163:6789/0}
   election epoch 52, quorum 0,1,2 
dbp-ceph01,dbp-ceph02,dbp-ceph03

osdmap e107: 2 osds: 2 up, 2 in
   flags noout
 pgmap v65678: 1088 pgs, 9 pools, 55199 kB data, 173 objects
   2265 MB used, 16580 MB / 19901 MB avail
546 active+clean
489 stale+active+clean
 53 stale+active+clean+inconsistent


root@dbp-ceph02:~# /usr/bin/ceph-osd --cluster=ceph -i 1 -d
2015-10-22 14:15:48.312507 7f4edabec900 0 ceph version 0.94.4 
(95292699291242794510b39ffde3f4df67898d3a), process ceph-osd, pid 31215
starting osd.1 at :/0 osd_data /var/lib/ceph/osd/ceph-1 
/var/lib/ceph/osd/ceph-1/journal
2015-10-22 14:15:48.352013 7f4edabec900 0 
filestore(/var/lib/ceph/osd/ceph-1) backend generic (magic 0xef53)
2015-10-22 14:15:48.355621 7f4edabec900 0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features: 
FIEMAP ioctl is supported and appears to work
2015-10-22 14:15:48.355655 7f4edabec900 0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features: 
FIEMAP ioctl is disabled via 'filestore fiemap' config option
2015-10-22 14:15:48.362016 7f4edabec900 0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features: 
syncfs(2) syscall fully supported (by glibc and kernel)
2015-10-22 14:15:48.372819 7f4edabec900 0 
filestore(/var/lib/ceph/osd/ceph-1) limited size xattrs
2015-10-22 14:15:48.387002 7f4edabec900 0 
filestore(/var/lib/ceph/osd/ceph-1) mount: enabling WRITEAHEAD 
journal mode: checkpoint is not enabled
2015-10-22 14:15:48.394002 7f4edabec900 -1 journal 
FileJournal::_open: disabling aio for non-block journal. Use 
journal_force_aio to force use of aio anyway
2015-10-22 14:15:48.397803 7f4edabec900 0  
cls/hello/cls_hello.cc:271: loading cls_hello
terminate called after throwing an instance of 
'ceph::buffer::end_of_buffer'

 what(): buffer::end_of_buffer
*** Caught signal (Aborted) **
in thread 7f4edabec900
ceph version 0.94.4 (95292699291242794510b39ffde3f4df67898d3a)
1: /usr/bin/ceph-osd() [0xacd94a]
2: (()+0x10340) [0x7f4ed98a1340]
3: (gsignal()+0x39) [0x7f4ed7d3fcc9]
4: (abort()+0x148) [0x7f4ed7d430d8]
5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f4ed864b6b5]
6: (()+0x5e836) [0x7f4ed8649836]
7: (()+0x5e863) [0x7f4ed8649863]
8: (()+0x5eaa2) [0x7f4ed8649aa2]
9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x137) 
[0xc35ef7]

10: (OSDMap::decode(ceph::buffer::list::iterator&)+0x6d) [0xb834ed]
11: (OSDMap::decode(ceph::buffer::list&)+0x3f) [0xb8560f]
12: (OSDService::try_get_map(unsigned int)+0x530) [0x6ac2c0]
13: (OSDService::get_map(unsigned int)+0xe) [0x70ad2e]
14: (OSD::init()+0x6ad) [0x6c5e0d]
15: (main()+0x2860) [0x6527e0]
16: (__libc_start_main()+0xf5) [0x7f4ed7d2aec5]
17: /usr/bin/ceph-osd() [0x66b887]
2015-10-22 14:15:48.412520 7f4edabec900 -1 *** Caught signal 
(Aborted) **

in thread 7f4edabec900

ceph version 0.94.4 (95292699291242794510b39ffde3f4df67898d3a)
1: /usr/bin/ceph-osd() [0xacd94a]
2: (()+0x10340) [0x7f4ed98a1340]
3: (gsignal()+0x39) [0x7f4ed7d3fcc9]
4: (abort()+0x148) [0x7f4ed7d430d8]
5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f4ed864b6b5]
6: (()+0x5e836) [0x7f4ed8649836]
7: (()+0x5e863) [0x7f4ed8649863]
8: (()+0x5eaa2) [0x7f4ed8649aa2]
9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x137) 
[0xc35ef7]

10: (OSDMap::decode(ceph::buffer::list::iterator&)+0x6d) [0xb834ed]
11: (OSDMap::decode(ceph::buffer

Re: [ceph-users] Core dump when running OSD service

2015-10-22 Thread David Zafman


There is a corruption of the osdmaps on this particular OSD.  You need 
determine which maps are bad probably by bumping the osd debug level to 
20.  Then transfer them from a working OSD.  The newest 
ceph-objectstore-tool has features to write the maps, but you'll need to 
build a version based on a v0.94.4 source tree.  I don't know if you can 
just copy files with names like 
"current/meta/osdmap.8__0_FD6E4D61__none" (map for epoch 8) between OSDs.


David

On 10/21/15 8:54 PM, James O'Neill wrote:
I have an OSD that didn't come up after a reboot. I was getting the 
error show below. it was running 0.94.3 so I reinstalled all packages. 
I then upgraded everything to 0.94.4 hoping that would fix it but it 
hasn't. There are three OSDs, this is the only one having problems (it 
also contains the inconsistent pgs). Can anyone tell me what the 
problem might be?



root@dbp-ceph03:/srv/data# ceph status
   cluster 4f6fb784-bd17-4105-a689-e8d1b4bc5643
health HEALTH_ERR
   53 pgs inconsistent
   542 pgs stale
   542 pgs stuck stale
   5 requests are blocked > 32 sec
   85 scrub errors
   too many PGs per OSD (544 > max 300)
   noout flag(s) set
monmap e3: 3 mons at 
{dbp-ceph01=172.17.241.161:6789/0,dbp-ceph02=172.17.241.162:6789/0,dbp-ceph03=172.17.241.163:6789/0}
   election epoch 52, quorum 0,1,2 
dbp-ceph01,dbp-ceph02,dbp-ceph03

osdmap e107: 2 osds: 2 up, 2 in
   flags noout
 pgmap v65678: 1088 pgs, 9 pools, 55199 kB data, 173 objects
   2265 MB used, 16580 MB / 19901 MB avail
546 active+clean
489 stale+active+clean
 53 stale+active+clean+inconsistent


root@dbp-ceph02:~# /usr/bin/ceph-osd --cluster=ceph -i 1 -d
2015-10-22 14:15:48.312507 7f4edabec900 0 ceph version 0.94.4 
(95292699291242794510b39ffde3f4df67898d3a), process ceph-osd, pid 31215
starting osd.1 at :/0 osd_data /var/lib/ceph/osd/ceph-1 
/var/lib/ceph/osd/ceph-1/journal
2015-10-22 14:15:48.352013 7f4edabec900 0 
filestore(/var/lib/ceph/osd/ceph-1) backend generic (magic 0xef53)
2015-10-22 14:15:48.355621 7f4edabec900 0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features: 
FIEMAP ioctl is supported and appears to work
2015-10-22 14:15:48.355655 7f4edabec900 0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features: 
FIEMAP ioctl is disabled via 'filestore fiemap' config option
2015-10-22 14:15:48.362016 7f4edabec900 0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features: 
syncfs(2) syscall fully supported (by glibc and kernel)
2015-10-22 14:15:48.372819 7f4edabec900 0 
filestore(/var/lib/ceph/osd/ceph-1) limited size xattrs
2015-10-22 14:15:48.387002 7f4edabec900 0 
filestore(/var/lib/ceph/osd/ceph-1) mount: enabling WRITEAHEAD journal 
mode: checkpoint is not enabled
2015-10-22 14:15:48.394002 7f4edabec900 -1 journal FileJournal::_open: 
disabling aio for non-block journal. Use journal_force_aio to force 
use of aio anyway
2015-10-22 14:15:48.397803 7f4edabec900 0  
cls/hello/cls_hello.cc:271: loading cls_hello
terminate called after throwing an instance of 
'ceph::buffer::end_of_buffer'

 what(): buffer::end_of_buffer
*** Caught signal (Aborted) **
in thread 7f4edabec900
ceph version 0.94.4 (95292699291242794510b39ffde3f4df67898d3a)
1: /usr/bin/ceph-osd() [0xacd94a]
2: (()+0x10340) [0x7f4ed98a1340]
3: (gsignal()+0x39) [0x7f4ed7d3fcc9]
4: (abort()+0x148) [0x7f4ed7d430d8]
5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f4ed864b6b5]
6: (()+0x5e836) [0x7f4ed8649836]
7: (()+0x5e863) [0x7f4ed8649863]
8: (()+0x5eaa2) [0x7f4ed8649aa2]
9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x137) 
[0xc35ef7]

10: (OSDMap::decode(ceph::buffer::list::iterator&)+0x6d) [0xb834ed]
11: (OSDMap::decode(ceph::buffer::list&)+0x3f) [0xb8560f]
12: (OSDService::try_get_map(unsigned int)+0x530) [0x6ac2c0]
13: (OSDService::get_map(unsigned int)+0xe) [0x70ad2e]
14: (OSD::init()+0x6ad) [0x6c5e0d]
15: (main()+0x2860) [0x6527e0]
16: (__libc_start_main()+0xf5) [0x7f4ed7d2aec5]
17: /usr/bin/ceph-osd() [0x66b887]
2015-10-22 14:15:48.412520 7f4edabec900 -1 *** Caught signal (Aborted) **
in thread 7f4edabec900

ceph version 0.94.4 (95292699291242794510b39ffde3f4df67898d3a)
1: /usr/bin/ceph-osd() [0xacd94a]
2: (()+0x10340) [0x7f4ed98a1340]
3: (gsignal()+0x39) [0x7f4ed7d3fcc9]
4: (abort()+0x148) [0x7f4ed7d430d8]
5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f4ed864b6b5]
6: (()+0x5e836) [0x7f4ed8649836]
7: (()+0x5e863) [0x7f4ed8649863]
8: (()+0x5eaa2) [0x7f4ed8649aa2]
9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x137) 
[0xc35ef7]

10: (OSDMap::decode(ceph::buffer::list::iterator&)+0x6d) [0xb834ed]
11: (OSDMap::decode(ceph::buffer::list&)+0x3f) [0xb8560f]
12: (OSDService::try_get_map(unsigned int)+0x530) [0x6ac2c0]
13: (OSDService::get_map(unsigned int)+0xe) [0x70ad2e]
14: (OSD::init()+0x6ad) [0x6c5e0d]
15: (main()+0x2860) [0x6527e0]
16: 

Re: [ceph-users] CephFS file to rados object mapping

2015-10-21 Thread David Zafman


See below

On 10/21/15 2:44 PM, Gregory Farnum wrote:

On Wed, Oct 14, 2015 at 7:20 PM, Francois Lafont  wrote:

Hi,

On 14/10/2015 06:45, Gregory Farnum wrote:


Ok, however during my tests I had been careful to replace the correct
file by a bad file with *exactly* the same size (the content of the
file was just a little string and I have changed it by a string with
exactly the same size). I had been careful to undo the mtime update
too (I had restore the mtime of the file before the change). Despite
this, the "repair" command worked well. Tested twice: 1. with the change
on the primary OSD and 2. on the secondary OSD. And I was surprised
because I though the test 1. (in primary OSD) will fail.

Hm. I'm a little confused by that, actually. Exactly what was the path
to the files you changed, and do you have before-and-after comparisons
on the content and metadata?

I didn't remember exactly the process I have made so I have just retried
today. Here is my process. I have a healthy cluster with 3 nodes (Ubuntu
Trusty) and I have ceph Hammer (version 0.94.3). I have mounted cephfs on
/mnt on one of the nodes.

~# cat /mnt/file.txt # yes it's a little file. ;)
123456

~# ls -i /mnt/file.txt
1099511627776 /mnt/file.txt

~# printf "%x\n" 1099511627776
100

~# rados -p data ls - | grep 100
100.

I have the name of the object mapped to my "file.txt".

~# ceph osd map data 100.
osdmap e76 pool 'data' (3) object '100.' -> pg 3.f0b56f30 (3.30) 
-> up ([1,2], p1) acting ([1,2], p1)

So my object is in the primary OSD OSD-1 and in the secondary OSD OSD-2.
So I open a terminal in the node which hosts the primary OSD OSD-1 and
then:

~# cat 
/var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3
123456

~# ll 
/var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3
-rw-r--r-- 1 root root 7 Oct 15 03:46 
/var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3

Now, I change the content with this script called "change_content.sh" to
preserve the mtime after the change:

-
#!/bin/sh

f="$1"
f_tmp="${f}.tmp"
content="$2"
cp --preserve=all "$f" "$f_tmp"
echo "$content" >"$f"
touch -r "$f_tmp" "$f" # to restore the mtime after the change
rm "$f_tmp"
-

So, let's go, I replace the content by a new content with exactly
the same size (ie "ABCDEF" in this example):

~# ./change_content.sh 
/var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3
 ABCDEF

~# cat 
/var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3
ABCDEF

~# ll 
/var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3
-rw-r--r-- 1 root root 7 Oct 15 03:46 
/var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3

Now, the secondary OSD contains the good version of the object and
the primary a bad version. Now, I launch a "ceph pg repair":

~# ceph pg repair 3.30
instructing pg 3.30 on osd.1 to repair

# I'm in the primary OSD and the file below has been repaired correctly.
~# cat 
/var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3
123456

As you can see, the repair command has worked well.
Maybe my little is too trivial?

Hmm, maybe David has some idea.


As of the Hammer release, a replicated object that is written 
sequentially maintains a CRC of the entire object.  This no I/O cost CRC 
is saved with other object information like size and mtime.   So in your 
test the bad replica is identified by comparing the CRC of what is read 
off of disk with the value in the object info.


David

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] O_DIRECT on deep-scrub read

2015-10-07 Thread David Zafman


There would be a benefit to doing fadvise POSIX_FADV_DONTNEED after 
deep-scrub reads for objects not recently accessed by clients.


I see the NewStore objectstore sometimes using the O_DIRECT  flag for 
writes.  This concerns me because the open(2) man pages says:


"Applications should avoid mixing O_DIRECT and normal I/O to the same 
file, and especially to overlapping byte regions in the same file.  Even 
when the filesystem correctly handles the coherency issues in this 
situation, overall I/O throughput is likely to be slower than using 
either mode alone."


David

On 10/7/15 7:50 AM, Sage Weil wrote:

It's not, but it would not be ahrd to do this.  There are fadvise-style
hints being passed down that could trigger O_DIRECT reads in this case.
That may not be the best choice, though--it won't use data that happens
to be in cache and it'll also throw it out..

On Wed, 7 Oct 2015, Pawe? Sadowski wrote:


Hi,

Can anyone tell if deep scrub is done using O_DIRECT flag or not? I'm
not able to verify that in source code.

If not would it be possible to add such feature (maybe config option) to
help keeping Linux page cache in better shape?

Thanks,

--
PS

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD respawning -- FAILED assert(clone_size.count(clone))

2015-09-08 Thread David Zafman


Chris,

I was wondering if you still had /tmp/snap.out laying around, could you 
send it to me?   The way the dump to json code works if the "clones" is 
empty it doesn't show me what two other structures look like.


David

On 9/5/15 3:24 PM, Chris Taylor wrote:

# ceph-dencoder type SnapSet import /tmp/snap.out decode dump_json
{
"snap_context": {
"seq": 9197,
"snaps": [
9197
]
},
"head_exists": 1,
    "clones": []
}


On 09/03/2015 04:48 PM, David Zafman wrote:


If you have ceph-dencoder installed or can build v0.94.3 to build the 
binary, you can dump the SnapSet for the problem object. Once you 
understand the removal procedure you could do the following to get a 
look at the SnapSet information.


Find the object from --op list with snapid -2 and cut and paste that 
json into the following command


Something like:
$ ceph-objectstore-tool --data-path xx --journal-path xx 
'["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid":-2,"hash":###,"max":0,"pool":3,"namespace":"","max":0}] 
' get-attr snapset > /tmp/snap.out


$ ceph-dencoder type SnapSet import /tmp/snap.out decode dump_json


{
"snap_context": {
"seq": 4,
"snaps": [
4,
3,
2,
1
]
},
"head_exists": 1,
"clones": [
{
"snap": 1,
"size": 1032,
"overlap": "[]"
},
{
"snap": 2,
    "size": 452,
"overlap": "[]"
},
{
"snap": 3,
"size": 452,
"overlap": "[]"
},
{
"snap": 4,
"size": 452,
"overlap": "[]"
}
]
}

On 9/3/15 2:44 PM, David Zafman wrote:


Chris,

WARNING: Do this at your own risk.  You are deleting one of the 
snapshots of a specific portion of an rbd image.  I'm not sure how 
rbd will react.  Maybe you should repair the SnapSet instead of 
remove the inconsistency.   However, as far as I know there isn't a 
tool to it.


If you are able to build from Ceph source, I happen to have an 
enhancement to ceph-objectstore-tool to output the SnapSet.


---

The message preceding the assert is in the same thread so " 
rb.0.8c2990.238e1f29.8cc0/23ed//3" has the object name in 
it.  The 23ed is the RADOS clone/snap ID.


First, get a backup by export the pg using the 
ceph-objectstore-tool.  Specify a --file somewhere with enough of 
disk space.


$ ceph-objectstore-tool --data-path xx --journal-path xx 
--op export --pgid 3.f9 --file destination

Exporting 3.f9

Read 3/c55800f9/rb.0.8c2990.238e1f29.8cc0/23ed

Export successful

Now you need the JSON of the object in question.  The 3rd line of 
output has the snapid 9197 which is 23ed in decimal.


$ ceph-objectstore-tool --data-path xx --journal-path xx 
--op list rb.0.8c2990.238e1f29.8cc0



["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid":9196,"hash":###,"max":0,"pool":3,"namespace":"","max":0}] 

["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid",9197,"hash":###,"max":0,"pool":3,"namespace":"","max":0}] 

["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid":9198,"hash":###,"max":0,"pool":3,"namespace":"","max":0}] 

["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid":-2,"hash":###,"max":0,"pool":3,"namespace":"","max":0}] 



To remove it, cut and paste your output line with snapid 9197 inside 
single quotes like this:


$ ceph-objectstore-tool --data-path xx --journal-path xx 
'["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid":9197,"hash":###,"max":0,"pool":3,"namespace":"","max":0}]' 
remove



remove 3/c55800f9/rb.0.8c2990.238e1f29.8cc0/23ed

To get all the OSDs to boot you'll have to do the remove on all OSDs 
that contain this PG and have an entry with snapid 9197 for this 
object.

Re: [ceph-users] OSD respawning -- FAILED assert(clone_size.count(clone))

2015-09-04 Thread David Zafman


Chris,

I see that you have stack traces that indicate some OSDs are running 
v0.94.2 (osd.23) and some running v0.94.3 (osd.30).  They should  be 
running the same release except briefly while upgrading.  I see some 
snapshot/cache tiering fixes went into 0.94.3.  So an OSD running 
v0.94.2 when you enabled cache tiering may have been the root cause of 
the SnapSet issue.  Once that had occurred any version OSD can crash 
because the bad SnapSet gets replicated.


I'd love to see the SnapSet per my ceph-dencoder instructions in a prior 
e-mail.  This would help me verify the root cause.


See my inlined comments, but to bring it all together:

1. Fix osd.10 by removing extra clone as you did on osd.30
2. If you can, get me the ceph-dencode output of the bad SnapSet
3. Verify cluster is stable and run scrub on pg 3.f9 and preferably all 
pool 3 PGs

4. Delete old rbd pool (pool 3) and create a new one
5. Restore RBD images from backup using new pool (make sure you have 
disk space as the pool delete removes objects asynchronously)


David

On 9/3/15 8:15 PM, Chris Taylor wrote:

On 09/03/2015 02:44 PM, David Zafman wrote:


Chris,

WARNING: Do this at your own risk.  You are deleting one of the 
snapshots of a specific portion of an rbd image.  I'm not sure how 
rbd will react.  Maybe you should repair the SnapSet instead of 
remove the inconsistency.   However, as far as I know there isn't a 
tool to it.



Would removing all the snapshots of an RBD image fix the SnapSet?
Now I'm not sure you can remove the images without causing a crash until 
you can scrub.  Fix osd.10 as indicated below.




If I remove the RBD image and re-import from backup with "rbd 
import-diff ..." will that fix it?
Once you have a stable cluster and can scrub this PG and probably all 
pool 3 PGs, then out of an abundance of caution, I would delete the 
pool, create a new one and restore the RBD images from backup.


If you are able to build from Ceph source, I happen to have an 
enhancement to ceph-objectstore-tool to output the SnapSet.


---

The message preceding the assert is in the same thread so " 
rb.0.8c2990.238e1f29.8cc0/23ed//3" has the object name in 
it.  The 23ed is the RADOS clone/snap ID.


First, get a backup by export the pg using the 
ceph-objectstore-tool.  Specify a --file somewhere with enough of 
disk space.


$ ceph-objectstore-tool --data-path xx --journal-path xx --op 
export --pgid 3.f9 --file destination

Exporting 3.f9

Read 3/c55800f9/rb.0.8c2990.238e1f29.8cc0/23ed

Export successful

I was able to export the PG.


Now you need the JSON of the object in question.  The 3rd line of 
output has the snapid 9197 which is 23ed in decimal.


$ ceph-objectstore-tool --data-path xx --journal-path xx --op 
list rb.0.8c2990.238e1f29.8cc0



["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid":9196,"hash":###,"max":0,"pool":3,"namespace":"","max":0}] 

["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid",9197,"hash":###,"max":0,"pool":3,"namespace":"","max":0}] 

["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid":9198,"hash":###,"max":0,"pool":3,"namespace":"","max":0}] 

["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid":-2,"hash":###,"max":0,"pool":3,"namespace":"","max":0}] 



To remove it, cut and paste your output line with snapid 9197 inside 
single quotes like this:


$ ceph-objectstore-tool --data-path xx --journal-path xx 
'["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid":9197,"hash":###,"max":0,"pool":3,"namespace":"","max":0}]' 
remove



remove 3/c55800f9/rb.0.8c2990.238e1f29.8cc0/23ed

I removed the object. After starting the OSD I'm now getting an error 
that a shard is missing and the OSD crashes.


   -4> 2015-09-03 20:11:52.471741 7fdc0d42c700  2 osd.30 pg_epoch: 
231748 pg[3.f9( v 231748'10799542 (222304'10796422,231748'10799542] 
local-les=231748 n=11658 ec=101 les/c 231748/231748 
231747/231747/231747) [30,10] r=0 lpr=231747 lua=231699'10799538 
crt=231699'10799538 lcod 231748'10799541 mlcod 0'0 
active+clean+scrubbing+deep] scrub_compare_maps   osd.30 has 25 items
-3> 2015-09-03 20:11:52.471772 7fdc0d42c700  2 osd.30 pg_epoch: 
231748 pg[3.f9( v 231748'10799542 (222304'10796422,231748'10799542] 
local-les=2

Re: [ceph-users] OSD respawning -- FAILED assert(clone_size.count(clone))

2015-09-03 Thread David Zafman


This crash is what happens if a clone is missing from SnapSet (internal 
data) for an object in the ObjectStore.  If you had out of space issues, 
this could possibly have been caused by being able to rename or create 
files in a directory, but not being able to update SnapSet.


I've completely rewritten that logic so scrub doesn't crash, but it 
hasn't been in a release yet.  In the future scrub will just report an 
unexpected clone in the ObjectStore.


You'll need to find and remove the extraneous clone.   Bump the "debug 
osd" to 20 so that you'll get the name of the object in the log.  Start 
an OSD and after it crashes examine the log.  Then remove the extraneous 
object using ceph-objectstore-tool.  You'll have to repeat this process 
if there are more of these.


David

On 9/3/15 2:22 AM, Gregory Farnum wrote:

On Thu, Sep 3, 2015 at 7:48 AM, Chris Taylor  wrote:

I removed the latest OSD that was respawing (osd.23) and now I having the
same problem with osd.30. It looks like they both have pg 3.f9 in common. I
tried "ceph pg repair 3.f9" but the OSD is still respawning.

Does anyone have any ideas?

This is definitely something wrong with the disk state on the OSD. Are
you using cache pools or any other features that aren't so standard?
What workload is running against this cluster? How are the OSDs
configured (btrfs/xfs/ext4, on single hard drives, etc)?
David might have some ideas what's gone wrong or what else besides the
log tail would be needed to diagnose.
-Greg


Thanks,
Chris


ceph-osd-03:ceph-osd.30.log

-29> 2015-09-02 23:31:44.041181 7fbd1bf01700  0 log_channel(cluster) log
[INF] : 3.f9 deep-scrub starts
-28> 2015-09-02 23:31:44.042339 7fbd1bf01700  1 -- 10.21.0.23:6824/3512
--> 10.21.0.22:6800/3623 -- replica scrub(pg:
3.f9,from:0'0,to:210364'10793453,epoch:222344,start:0//0//-1,end:1ff300f9//0//-1,chunky:1,deep:1,seed:4294967295,version:6)
v6 -- ?+0 0x19906f80 con 0x186714a0
-27> 2015-09-02 23:31:44.055769 7fbd27718700  1 -- 10.20.0.23:6825/3512
<== osd.43 10.20.0.21:0/3850 1  osd_ping(ping e222344 stamp 2015-09-02
23:31:44.055321) v2  47+0+0 (2626217624 0 0) 0x4b43800 con 0x19f466e0
-26> 2015-09-02 23:31:44.055805 7fbd27718700  1 -- 10.20.0.23:6825/3512
--> 10.20.0.21:0/3850 -- osd_ping(ping_reply e222344 stamp 2015-09-02
23:31:44.055321) v2 -- ?+0 0x19adba00 con 0x19f466e0
-25> 2015-09-02 23:31:44.056016 7fbd25f15700  1 -- 10.21.0.23:6825/3512
<== osd.43 10.20.0.21:0/3850 1  osd_ping(ping e222344 stamp 2015-09-02
23:31:44.055321) v2  47+0+0 (2626217624 0 0) 0x4a72200 con 0x19f46160
-24> 2015-09-02 23:31:44.056043 7fbd25f15700  1 -- 10.21.0.23:6825/3512
--> 10.20.0.21:0/3850 -- osd_ping(ping_reply e222344 stamp 2015-09-02
23:31:44.055321) v2 -- ?+0 0x1b9b4000 con 0x19f46160
-23> 2015-09-02 23:31:44.111037 7fbd177f2700  1 -- 10.21.0.23:6824/3512
<== osd.10 10.21.0.22:6800/3623 28  osd_sub_op(unknown.0.0:0 3.f9
0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[]) v11  1226+0+13977
(2839644439 0 129553580) 0x19a3f700 con 0x186714a0
-22> 2015-09-02 23:31:44.111071 7fbd177f2700  5 -- op tracker -- seq:
354, time: 2015-09-02 23:31:44.110934, event: header_read, op:
osd_sub_op(unknown.0.0:0 3.f9 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[]
snapc=0=[])
-21> 2015-09-02 23:31:44.111079 7fbd177f2700  5 -- op tracker -- seq:
354, time: 2015-09-02 23:31:44.110936, event: throttled, op:
osd_sub_op(unknown.0.0:0 3.f9 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[]
snapc=0=[])
-20> 2015-09-02 23:31:44.111085 7fbd177f2700  5 -- op tracker -- seq:
354, time: 2015-09-02 23:31:44.111028, event: all_read, op:
osd_sub_op(unknown.0.0:0 3.f9 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[]
snapc=0=[])
-19> 2015-09-02 23:31:44.111090 7fbd177f2700  5 -- op tracker -- seq:
354, time: 0.00, event: dispatched, op: osd_sub_op(unknown.0.0:0 3.f9
0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[])
-18> 2015-09-02 23:31:44.42 7fbd1f708700  5 -- op tracker -- seq:
354, time: 2015-09-02 23:31:44.42, event: reached_pg, op:
osd_sub_op(unknown.0.0:0 3.f9 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[]
snapc=0=[])
-17> 2015-09-02 23:31:44.67 7fbd1f708700  5 -- op tracker -- seq:
354, time: 2015-09-02 23:31:44.67, event: started, op:
osd_sub_op(unknown.0.0:0 3.f9 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[]
snapc=0=[])
-16> 2015-09-02 23:31:44.111262 7fbd1f708700  5 -- op tracker -- seq:
354, time: 2015-09-02 23:31:44.111262, event: done, op:
osd_sub_op(unknown.0.0:0 3.f9 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[]
snapc=0=[])
-15> 2015-09-02 23:31:44.111374 7fbd1bf01700  2 osd.30 pg_epoch: 222344
pg[3.f9( v 222344'10796467 (210364'10793453,222344'10796467]
local-les=222344 n=11597 ec=101 les/c 222344/222344 222343/222343/222343)
[30,10] r=0 lpr=222343 crt=222335'10796453 lcod 222344'10796466 mlcod
222344'10796466 active+clean+scrubbing+deep] scrub_compare_maps   osd.30 has
24 items
-14> 2015-09-02 

Re: [ceph-users] OSD respawning -- FAILED assert(clone_size.count(clone))

2015-09-03 Thread David Zafman


Chris,

WARNING: Do this at your own risk.  You are deleting one of the 
snapshots of a specific portion of an rbd image.  I'm not sure how rbd 
will react.  Maybe you should repair the SnapSet instead of remove the 
inconsistency.   However, as far as I know there isn't a tool to it.


If you are able to build from Ceph source, I happen to have an 
enhancement to ceph-objectstore-tool to output the SnapSet.


---

The message preceding the assert is in the same thread so " 
rb.0.8c2990.238e1f29.8cc0/23ed//3" has the object name in it.  
The 23ed is the RADOS clone/snap ID.


First, get a backup by export the pg using the ceph-objectstore-tool.  
Specify a --file somewhere with enough of disk space.


$ ceph-objectstore-tool --data-path xx --journal-path xx --op 
export --pgid 3.f9 --file destination

Exporting 3.f9

Read 3/c55800f9/rb.0.8c2990.238e1f29.8cc0/23ed

Export successful

Now you need the JSON of the object in question.  The 3rd line of output 
has the snapid 9197 which is 23ed in decimal.


$ ceph-objectstore-tool --data-path xx --journal-path xx --op 
list rb.0.8c2990.238e1f29.8cc0



["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid":9196,"hash":###,"max":0,"pool":3,"namespace":"","max":0}]
["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid",9197,"hash":###,"max":0,"pool":3,"namespace":"","max":0}]
["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid":9198,"hash":###,"max":0,"pool":3,"namespace":"","max":0}]
["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid":-2,"hash":###,"max":0,"pool":3,"namespace":"","max":0}]

To remove it, cut and paste your output line with snapid 9197 inside 
single quotes like this:


$ ceph-objectstore-tool --data-path xx --journal-path xx 
'["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid":9197,"hash":###,"max":0,"pool":3,"namespace":"","max":0}]' 
remove



remove 3/c55800f9/rb.0.8c2990.238e1f29.8cc0/23ed

To get all the OSDs to boot you'll have to do the remove on all OSDs 
that contain this PG and have an entry with snapid 9197 for this object.


David

On 9/3/15 11:29 AM, Chris Taylor wrote:

On 09/03/2015 10:20 AM, David Zafman wrote:


This crash is what happens if a clone is missing from SnapSet 
(internal data) for an object in the ObjectStore.  If you had out of 
space issues, this could possibly have been caused by being able to 
rename or create files in a directory, but not being able to update 
SnapSet.


I've completely rewritten that logic so scrub doesn't crash, but it 
hasn't been in a release yet.  In the future scrub will just report 
an unexpected clone in the ObjectStore.


You'll need to find and remove the extraneous clone.   Bump the 
"debug osd" to 20 so that you'll get the name of the object in the 
log.  Start an OSD and after it crashes examine the log. Then remove 
the extraneous object using ceph-objectstore-tool. You'll have to 
repeat this process if there are more of these.


David


I looked for an example of how to use the ceph-objectstore-tool aside 
from what was provided with "-h". I really don't know how to read the 
log output to get the object name. Can you please provide an example?


Here is the end of the log after bumping the OSD debug up to 20:

   -11> 2015-09-03 10:52:15.164310 7fb1f4afe700 10 osd.30 pg_epoch: 
227938 pg[3.f9( v 227870'10797633 (211962'10794623,227870'10797633] 
local-les=227938 n=11612 ec=101 les/c 227938/227938 
227937/227937/227937) [30,10] r=0 lpr=227937 crt=227623'10797625 lcod 
0'0 mlcod 0'0 active+clean+scrubbing+deep] be_select_auth_object: 
selecting osd 10 for obj 
1ee800f9/rb.0.4b777.2ae8944a.006cfd7e/head//3
   -10> 2015-09-03 10:52:15.164334 7fb1f4afe700 10 osd.30 pg_epoch: 
227938 pg[3.f9( v 227870'10797633 (211962'10794623,227870'10797633] 
local-les=227938 n=11612 ec=101 les/c 227938/227938 
227937/227937/227937) [30,10] r=0 lpr=227937 crt=227623'10797625 lcod 
0'0 mlcod 0'0 active+clean+scrubbing+deep] be_select_auth_object: 
selecting osd 30 for obj 
1ee800f9/rb.0.4b777.2ae8944a.006cfd7e/head//3
-9> 2015-09-03 10:52:15.164359 7fb1f4afe700 10 osd.30 pg_epoch: 
227938 pg[3.f9( v 227870'10797633 (211962'10794623,227870'10797633] 
local-les=227938 n=11612 ec=101 les/c 227938/227938 
2279

Re: [ceph-users] OSD respawning -- FAILED assert(clone_size.count(clone))

2015-09-03 Thread David Zafman


If you have ceph-dencoder installed or can build v0.94.3 to build the 
binary, you can dump the SnapSet for the problem object.   Once you 
understand the removal procedure you could do the following to get a 
look at the SnapSet information.


Find the object from --op list with snapid -2 and cut and paste that 
json into the following command


Something like:
$ ceph-objectstore-tool --data-path xx --journal-path xx 
'["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid":-2,"hash":###,"max":0,"pool":3,"namespace":"","max":0}] 
' get-attr snapset > /tmp/snap.out


$ ceph-dencoder type SnapSet import /tmp/snap.out decode dump_json


{
"snap_context": {
"seq": 4,
"snaps": [
4,
3,
2,
1
]
},
"head_exists": 1,
"clones": [
{
"snap": 1,
"size": 1032,
"overlap": "[]"
},
{
"snap": 2,
"size": 452,
"overlap": "[]"
},
{
"snap": 3,
"size": 452,
"overlap": "[]"
},
{
"snap": 4,
"size": 452,
"overlap": "[]"
}
]
}

On 9/3/15 2:44 PM, David Zafman wrote:


Chris,

WARNING: Do this at your own risk.  You are deleting one of the 
snapshots of a specific portion of an rbd image.  I'm not sure how rbd 
will react.  Maybe you should repair the SnapSet instead of remove the 
inconsistency.   However, as far as I know there isn't a tool to it.


If you are able to build from Ceph source, I happen to have an 
enhancement to ceph-objectstore-tool to output the SnapSet.


---

The message preceding the assert is in the same thread so " 
rb.0.8c2990.238e1f29.8cc0/23ed//3" has the object name in it.  
The 23ed is the RADOS clone/snap ID.


First, get a backup by export the pg using the ceph-objectstore-tool.  
Specify a --file somewhere with enough of disk space.


$ ceph-objectstore-tool --data-path xx --journal-path xx --op 
export --pgid 3.f9 --file destination

Exporting 3.f9

Read 3/c55800f9/rb.0.8c2990.238e1f29.8cc0/23ed

Export successful

Now you need the JSON of the object in question.  The 3rd line of 
output has the snapid 9197 which is 23ed in decimal.


$ ceph-objectstore-tool --data-path xx --journal-path xx --op 
list rb.0.8c2990.238e1f29.8cc0



["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid":9196,"hash":###,"max":0,"pool":3,"namespace":"","max":0}] 

["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid",9197,"hash":###,"max":0,"pool":3,"namespace":"","max":0}] 

["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid":9198,"hash":###,"max":0,"pool":3,"namespace":"","max":0}] 

["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid":-2,"hash":###,"max":0,"pool":3,"namespace":"","max":0}] 



To remove it, cut and paste your output line with snapid 9197 inside 
single quotes like this:


$ ceph-objectstore-tool --data-path xx --journal-path xx 
'["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid":9197,"hash":###,"max":0,"pool":3,"namespace":"","max":0}]' 
remove



remove 3/c55800f9/rb.0.8c2990.238e1f29.8cc0/23ed

To get all the OSDs to boot you'll have to do the remove on all OSDs 
that contain this PG and have an entry with snapid 9197 for this object.


David

On 9/3/15 11:29 AM, Chris Taylor wrote:

On 09/03/2015 10:20 AM, David Zafman wrote:


This crash is what happens if a clone is missing from SnapSet 
(internal data) for an object in the ObjectStore.  If you had out of 
space issues, this could possibly have been caused by being able to 
rename or create files in a directory, but not being able to update 
SnapSet.


I've completely rewritten that logic so scrub doesn't crash, but it 
hasn't been in a release yet.  In the future scrub will just report 
an unexpected clone in the ObjectStore.


You'll need to fin

Re: [ceph-users] Help with inconsistent pg on EC pool, v9.0.2

2015-08-28 Thread David Zafman


Without my latest branch which hasn't merged yet, you can't repair an EC 
pg in the situation that the shard with a bad checksum is in the first k 
chunks.


A way to fix it would be to take that osd down/out and let recovery 
regenerate the chunk.  Remove the pg from the osd 
(ceph-objectstore-tool) and then you can bring the osd back up/in.


David

On 8/28/15 11:06 AM, Samuel Just wrote:

David, does this look familiar?
-Sam

On Fri, Aug 28, 2015 at 10:43 AM, Aaron Ten Clay aaro...@aarontc.com wrote:

Hi Cephers,

I'm trying to resolve an inconsistent pg on an erasure-coded pool, running
Ceph 9.0.2. I can't seem to get Ceph to run a repair or even deep-scrub the
pg again. Here's the background, with my attempted resolution steps below.
Hopefully someone can steer me in the right direction. Thanks in advance!

Current state:
# ceph health detail
HEALTH_ERR 1 pgs inconsistent; 1 scrub errors; noout flag(s) set
pg 2.36 is active+clean+inconsistent, acting
[1,21,12,9,0,10,14,7,18,20,5,4,22,16]
1 scrub errors
noout flag(s) set

I started by looking at the log file for osd.1, where I found the cause of
the inconsistent report:

2015-08-24 00:43:10.391621 7f09fcff9700  0 log_channel(cluster) log [INF] :
2.36 deep-scrub starts
2015-08-24 01:54:59.933532 7f09fcff9700 -1 log_channel(cluster) log [ERR] :
2.36s0 shard 21(1): soid 576340b6/1005990.0199/head//2 candidate had
a read error
2015-08-24 02:34:41.380740 7f09fcff9700 -1 log_channel(cluster) log [ERR] :
2.36s0 deep-scrub 0 missing, 1 inconsistent objects
2015-08-24 02:34:41.380757 7f09fcff9700 -1 log_channel(cluster) log [ERR] :
2.36 deep-scrub 1 errors

I checked osd.21, where this report appears:

2015-08-24 01:54:56.477020 7f707cbd4700  0 osd.21 pg_epoch: 31958 pg[2.36s1(
v 31957'43013 (7132'39997,31957'43013] local-les=31951 n=34556 ec=136 les/c
31951/31954 31945/31945/31924) [1,21,12,9,0,10,14,7,18,20,5,4,22,16] r=1
lpr=31945 pi=1131-31944/7827 luod=0'0 crt=31957'43011 active] _scan_list
576340b6/1005990.0199/head//2 got incorrect hash on read

So, based upon the ceph documentation, I thought I could repair the pg by
executing ceph pg repair 2.36. When I run this, while watching the mon
log, I see the command dispatch:

2015-08-28 10:14:17.964017 mon.0 [INF] from='client.? 10.42.5.61:0/1002181'
entity='client.admin' cmd=[{prefix: pg repair, pgid: 2.36}]:
dispatch

But I never see a finish in the mon log, like most ceph commands return.
(Not sure if I should expect to see a finish, just noting it doesn't occur.)

Also, tailing the logs for any OSD in the acting set for pg 2.36, I never
see anything about a repair. The same case holds when I try ceph pg 2.36
deep-scrub - command dispatched, but none of the OSDs care. In the past on
other clusters, I've seen [INF] : pg.id repair starts messages in the OSD
log after executing ceph pg nn.yy repair.

Further confusing me, I do see osd.1 start and finish other pg deep-scrubs,
before and after executing ceph pg 2.36 deep-scrub.

I know EC pools are special in several ways, but nothing in the Ceph manual
seems to indicate I can't deep-scrub or repair pgs in an EC pool...

Thanks for reading and any suggestions. I'm happy to provide complete log
files or more details if I've left out any information that could be
helpful.

ceph -s: http://hastebin.com/xetohugibi
ceph pg dump: http://hastebin.com/bijehoheve
ceph -v: ceph version 9.0.2 (be422c8f5b494c77ebcf0f7b95e5d728ecacb7f0)
ceph osd dump: http://hastebin.com/fitajuzeca

-Aaron


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help with inconsistent pg on EC pool, v9.0.2

2015-08-28 Thread David Zafman


On 8/28/15 4:18 PM, Aaron Ten Clay wrote:

How would I go about removing the bad PG with ceph-objectstore-tool? I'm
having trouble finding any documentation for said tool.


ceph_objectsore_tool —data-path /var/lib/ceph/osd/ceph-0 —journal-path 
/var/lib/ceph/osd/ceph-0/journal —pgid 2.36s1 —op remove


Is it safe to just move /var/lib/ceph/osd/ceph-21/current/2.36s1_head to
another place and start the OSD process again?


yes


Can I safely tar the directory with tar -cvp --xattrs -f
/opt/osd-21-pg-2.36s1-removed_2015-08-28.tar
/var/lib/ceph/osd/ceph-21/current/2.36s1_*, then rm -rf
/var/lib/ceph/osd/ceph-21/current/2.36s1_*?


Better to use the --op export feature if you want to save the pg state:

ceph_objectsore_tool —data-path /var/lib/ceph/osd/ceph-X —journal-path 
/var/lib/ceph/osd/ceph-X/journal —pgid 2.36s1 —op export --file 
save2.36.s1.export


Just want to make sure I don't do something silly and shoot myself in the
foot.

Thanks!
-Aaron

On Fri, Aug 28, 2015 at 12:16 PM, David Zafman dzaf...@redhat.com wrote:


I don't know about removing the OSD from the CRUSH map.  That seems like
overkill to me.

I just realized a possible better way.  It would have been to take OSD
down not out.  Remove the ECs PG with the bad chunk.  Bring it up again and
let recovery repair just the single missing PG on the single OSD with no
other disruption.

David


On 8/28/15 11:28 AM, Aaron Ten Clay wrote:


Thanks for the tip, David. I've marked osd.21 down and out and will wait
for recovery. I've never had success manually manipulating the OSD
contents
- I assume I can achieve the same result by removing osd.21 from the CRUSH
map, ceph osd rm 21, then recreating it from scratch as though I'd lost
a
disk?

-Aaron

On Fri, Aug 28, 2015 at 11:17 AM, David Zafman dzaf...@redhat.com
wrote:

Without my latest branch which hasn't merged yet, you can't repair an EC

pg in the situation that the shard with a bad checksum is in the first k
chunks.

A way to fix it would be to take that osd down/out and let recovery
regenerate the chunk.  Remove the pg from the osd (ceph-objectstore-tool)
and then you can bring the osd back up/in.

David


On 8/28/15 11:06 AM, Samuel Just wrote:

David, does this look familiar?

-Sam

On Fri, Aug 28, 2015 at 10:43 AM, Aaron Ten Clay aaro...@aarontc.com
wrote:

Hi Cephers,

I'm trying to resolve an inconsistent pg on an erasure-coded pool,
running
Ceph 9.0.2. I can't seem to get Ceph to run a repair or even deep-scrub
the
pg again. Here's the background, with my attempted resolution steps
below.
Hopefully someone can steer me in the right direction. Thanks in
advance!

Current state:
# ceph health detail
HEALTH_ERR 1 pgs inconsistent; 1 scrub errors; noout flag(s) set
pg 2.36 is active+clean+inconsistent, acting
[1,21,12,9,0,10,14,7,18,20,5,4,22,16]
1 scrub errors
noout flag(s) set

I started by looking at the log file for osd.1, where I found the cause
of
the inconsistent report:

2015-08-24 00:43:10.391621 7f09fcff9700  0 log_channel(cluster) log
[INF] :
2.36 deep-scrub starts
2015-08-24 01:54:59.933532 7f09fcff9700 -1 log_channel(cluster) log
[ERR] :
2.36s0 shard 21(1): soid 576340b6/1005990.0199/head//2
candidate
had
a read error
2015-08-24 02:34:41.380740 7f09fcff9700 -1 log_channel(cluster) log
[ERR] :
2.36s0 deep-scrub 0 missing, 1 inconsistent objects
2015-08-24 02:34:41.380757 7f09fcff9700 -1 log_channel(cluster) log
[ERR] :
2.36 deep-scrub 1 errors

I checked osd.21, where this report appears:

2015-08-24 01:54:56.477020 7f707cbd4700  0 osd.21 pg_epoch: 31958
pg[2.36s1(
v 31957'43013 (7132'39997,31957'43013] local-les=31951 n=34556 ec=136
les/c
31951/31954 31945/31945/31924) [1,21,12,9,0,10,14,7,18,20,5,4,22,16]
r=1
lpr=31945 pi=1131-31944/7827 luod=0'0 crt=31957'43011 active]
_scan_list
576340b6/1005990.0199/head//2 got incorrect hash on read

So, based upon the ceph documentation, I thought I could repair the pg
by
executing ceph pg repair 2.36. When I run this, while watching the
mon
log, I see the command dispatch:

2015-08-28 10:14:17.964017 mon.0 [INF] from='client.?
10.42.5.61:0/1002181'
entity='client.admin' cmd=[{prefix: pg repair, pgid: 2.36}]:
dispatch

But I never see a finish in the mon log, like most ceph commands
return.
(Not sure if I should expect to see a finish, just noting it doesn't
occur.)

Also, tailing the logs for any OSD in the acting set for pg 2.36, I
never
see anything about a repair. The same case holds when I try ceph pg
2.36
deep-scrub - command dispatched, but none of the OSDs care. In the
past
on
other clusters, I've seen [INF] : pg.id repair starts messages in
the
OSD
log after executing ceph pg nn.yy repair.

Further confusing me, I do see osd.1 start and finish other pg
deep-scrubs,
before and after executing ceph pg 2.36 deep-scrub.

I know EC pools are special in several ways, but nothing in the Ceph
manual
seems to indicate I can't deep-scrub or repair pgs in an EC pool...

Thanks for reading and any suggestions. I'm happy

Re: [ceph-users] Help with inconsistent pg on EC pool, v9.0.2

2015-08-28 Thread David Zafman


I don't know about removing the OSD from the CRUSH map.  That seems like 
overkill to me.


I just realized a possible better way.  It would have been to take OSD 
down not out.  Remove the ECs PG with the bad chunk.  Bring it up again 
and let recovery repair just the single missing PG on the single OSD 
with no other disruption.


David

On 8/28/15 11:28 AM, Aaron Ten Clay wrote:

Thanks for the tip, David. I've marked osd.21 down and out and will wait
for recovery. I've never had success manually manipulating the OSD contents
- I assume I can achieve the same result by removing osd.21 from the CRUSH
map, ceph osd rm 21, then recreating it from scratch as though I'd lost a
disk?

-Aaron

On Fri, Aug 28, 2015 at 11:17 AM, David Zafman dzaf...@redhat.com wrote:


Without my latest branch which hasn't merged yet, you can't repair an EC
pg in the situation that the shard with a bad checksum is in the first k
chunks.

A way to fix it would be to take that osd down/out and let recovery
regenerate the chunk.  Remove the pg from the osd (ceph-objectstore-tool)
and then you can bring the osd back up/in.

David


On 8/28/15 11:06 AM, Samuel Just wrote:


David, does this look familiar?
-Sam

On Fri, Aug 28, 2015 at 10:43 AM, Aaron Ten Clay aaro...@aarontc.com
wrote:


Hi Cephers,

I'm trying to resolve an inconsistent pg on an erasure-coded pool,
running
Ceph 9.0.2. I can't seem to get Ceph to run a repair or even deep-scrub
the
pg again. Here's the background, with my attempted resolution steps
below.
Hopefully someone can steer me in the right direction. Thanks in advance!

Current state:
# ceph health detail
HEALTH_ERR 1 pgs inconsistent; 1 scrub errors; noout flag(s) set
pg 2.36 is active+clean+inconsistent, acting
[1,21,12,9,0,10,14,7,18,20,5,4,22,16]
1 scrub errors
noout flag(s) set

I started by looking at the log file for osd.1, where I found the cause
of
the inconsistent report:

2015-08-24 00:43:10.391621 7f09fcff9700  0 log_channel(cluster) log
[INF] :
2.36 deep-scrub starts
2015-08-24 01:54:59.933532 7f09fcff9700 -1 log_channel(cluster) log
[ERR] :
2.36s0 shard 21(1): soid 576340b6/1005990.0199/head//2 candidate
had
a read error
2015-08-24 02:34:41.380740 7f09fcff9700 -1 log_channel(cluster) log
[ERR] :
2.36s0 deep-scrub 0 missing, 1 inconsistent objects
2015-08-24 02:34:41.380757 7f09fcff9700 -1 log_channel(cluster) log
[ERR] :
2.36 deep-scrub 1 errors

I checked osd.21, where this report appears:

2015-08-24 01:54:56.477020 7f707cbd4700  0 osd.21 pg_epoch: 31958
pg[2.36s1(
v 31957'43013 (7132'39997,31957'43013] local-les=31951 n=34556 ec=136
les/c
31951/31954 31945/31945/31924) [1,21,12,9,0,10,14,7,18,20,5,4,22,16] r=1
lpr=31945 pi=1131-31944/7827 luod=0'0 crt=31957'43011 active] _scan_list
576340b6/1005990.0199/head//2 got incorrect hash on read

So, based upon the ceph documentation, I thought I could repair the pg by
executing ceph pg repair 2.36. When I run this, while watching the mon
log, I see the command dispatch:

2015-08-28 10:14:17.964017 mon.0 [INF] from='client.?
10.42.5.61:0/1002181'
entity='client.admin' cmd=[{prefix: pg repair, pgid: 2.36}]:
dispatch

But I never see a finish in the mon log, like most ceph commands
return.
(Not sure if I should expect to see a finish, just noting it doesn't
occur.)

Also, tailing the logs for any OSD in the acting set for pg 2.36, I never
see anything about a repair. The same case holds when I try ceph pg 2.36
deep-scrub - command dispatched, but none of the OSDs care. In the past
on
other clusters, I've seen [INF] : pg.id repair starts messages in the
OSD
log after executing ceph pg nn.yy repair.

Further confusing me, I do see osd.1 start and finish other pg
deep-scrubs,
before and after executing ceph pg 2.36 deep-scrub.

I know EC pools are special in several ways, but nothing in the Ceph
manual
seems to indicate I can't deep-scrub or repair pgs in an EC pool...

Thanks for reading and any suggestions. I'm happy to provide complete log
files or more details if I've left out any information that could be
helpful.

ceph -s: http://hastebin.com/xetohugibi
ceph pg dump: http://hastebin.com/bijehoheve
ceph -v: ceph version 9.0.2 (be422c8f5b494c77ebcf0f7b95e5d728ecacb7f0)
ceph osd dump: http://hastebin.com/fitajuzeca

-Aaron


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Giant not fixed RepllicatedPG:NotStrimming?

2014-11-04 Thread David Zafman

Can you upload the entire log file?

David

 On Nov 4, 2014, at 1:03 AM, Ta Ba Tuan tua...@vccloud.vn wrote:
 
 Hi Sam,
 I resend logs with debug options  http://123.30.41.138/ceph-osd.21.log 
 http://123.30.41.138/ceph-osd.21.log  
 (Sorry about my spam :D)
 
 I saw many missing objects :|
 
 2014-11-04 15:26:02.205607 7f3ab11a8700 10 osd.21 pg_epoch: 106407 pg[24.7d7( 
 v 106407'491583 lc 106401'491579 (105805'487042,106407'491583] loca
 l-les=106403 n=179 ec=25000 les/c 106403/106390 106402/106402/106402) 
 [21,28,4] r=0 lpr=106402 pi=106377-106401/4 rops=1 crt=106401'491581 mlcod 
 106393'491097 active+recovering+degraded m=2 snaptrimq=[306~1,312~1]] 
 recover_primary 675ea7d7/rbd_data.4930222ae8944a.0001/head//24 
 106401'491580 (missing) (missing head) (recovering) (recovering head)
 2014-11-04 15:26:02.205642 7f3ab11a8700 10 osd.21 pg_epoch: 106407 pg[24.7d7( 
 v 106407'491583 lc 106401'491579 (105805'487042,106407'491583] 
 local-les=106403 n=179 ec=25000 les/c 106403/106390 106402/106402/106402) 
 [21,28,4] r=0 lpr=106402 pi=106377-106401/4 rops=1 crt=106401'491581 mlcod 
 106393'491097 active+recovering+degraded m=2 snaptrimq=[306~1,312~1]] 
 recover_primary 
 d4d4bfd7/rbd_data.c6964d30a28220.035f/head//24 106401'491581 
 (missing) (missing head)
 2014-11-04 15:26:02.237994 7f3ab29ab700 10 osd.21 pg_epoch: 106407 pg[24.7d7( 
 v 106407'491583 lc 106401'491579 (105805'487042,106407'491583] 
 local-les=106403 n=179 ec=25000 les/c 106403/106390 106402/106402/106402) 
 [21,28,4] r=0 lpr=106402 pi=106377-106401/4 rops=2 crt=106401'491581 mlcod 
 106393'491097 active+recovering+degraded m=2 snaptrimq=[306~1,312~1]] got 
 missing d4d4bfd7/rbd_data.c6964d30a28220.035f/head//24 v 
 106401'491581
 
 Thanks Sam and All,
 --
 Tuan
 HaNoi-Vietnam
 
 On 11/04/2014 04:54 AM, Samuel Just wrote:
 Can you reproduce with
 
 debug osd = 20
 debug filestore = 20
 debug ms = 1
 
 In the [osd] section of that osd's ceph.conf?
 -Sam
 
 On Sun, Nov 2, 2014 at 9:10 PM, Ta Ba Tuan tua...@vccloud.vn 
 mailto:tua...@vccloud.vn wrote:
 Hi Sage, Samuel  All,
 
 I upgraded to GAINT, but still appearing that errors |:
 I'm trying on deleting  related objects/volumes, but very hard to verify
 missing objects :(.
 
 Guide me to resolve it, please! (I send attached detail log).
 
 2014-11-03 11:37:57.730820 7f28fb812700  0 osd.21 105950 do_command r=0
 2014-11-03 11:37:57.856578 7f28fc013700 -1 *** Caught signal (Segmentation
 fault) **
  in thread 7f28fc013700
 
  ceph version 0.87-6-gdba7def (dba7defc623474ad17263c9fccfec60fe7a439f0)
  1: /usr/bin/ceph-osd() [0x9b6725]
  2: (()+0xfcb0) [0x7f291fc2acb0]
  3: (ReplicatedPG::trim_object(hobject_t const)+0x395) [0x811b55]
  4: (ReplicatedPG::TrimmingObjects::react(ReplicatedPG::SnapTrim
 const)+0x43e) [0x82b9be]
  5: (boost::statechart::simple_stateReplicatedPG::TrimmingObjects,
 ReplicatedPG::SnapTrimmer, boost::mpl::listmpl_::na, mpl_::na, mpl_::na,
 mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
 mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
 mpl_::na, mpl_::na, mpl_::na,
 (boost::statechart::history_mode)0::react_impl(boost::statechart::event_base
 const, void const*)+0xc0) [0x870ce0]
  6: (boost::statechart::state_machineReplicatedPG::SnapTrimmer,
 ReplicatedPG::NotTrimming, std::allocatorvoid,
 boost::statechart::null_exception_translator::process_queued_events()+0xfb)
 [0x85618b]
  7: (boost::statechart::state_machineReplicatedPG::SnapTrimmer,
 ReplicatedPG::NotTrimming, std::allocatorvoid,
 boost::statechart::null_exception_translator::process_event(boost::statechart::event_base
 const)+0x1e) [0x85633e]
  8: (ReplicatedPG::snap_trimmer()+0x4f8) [0x7d5ef8]
  9: (OSD::SnapTrimWQ::_process(PG*)+0x14) [0x673ab4]
  10: (ThreadPool::worker(ThreadPool::WorkThread*)+0x48e) [0xa8fade]
  11: (ThreadPool::WorkThread::entry()+0x10) [0xa92870]
  12: (()+0x7e9a) [0x7f291fc22e9a]
  13: (clone()+0x6d) [0x7f291e5ed31d]
  NOTE: a copy of the executable, or `objdump -rdS executable` is needed to
 interpret this.
 
  -9993 2014-11-03 11:37:47.689335 7f28fc814700  1 -- 172.30.5.2:6803/7606
 -- 172.30.5.1:6886/3511 -- MOSDPGPull(6.58e 105950
 [PullOp(87f82d8e/rbd_data.45e62779c99cf1.22b5/head//6,
 recovery_info:
 ObjectRecoveryInfo(87f82d8e/rbd_data.45e62779c99cf1.22b5/head//6@105938'11622009,
 copy_subset: [0~18446744073709551615], clone_subset: {}), recovery_progress:
 ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false,
 omap_recovered_to:, omap_complete:false))]) v2 -- ?+0 0x26c59000 con
 0x22fbc420
 
 -2 2014-11-03 11:37:57.853585 7f2902820700  5 osd.21 pg_epoch: 105950
 pg[24.9e4( v 105946'113392 lc 105946'113391 (103622'109598,105946'113392]
 local-les=1
 05948 n=88 ec=25000 les/c 105948/105943 105947/105947/105947) [21,112,33]
 r=0 lpr=105947 pi=105933-105946/4 crt=105946'113392 lcod 0'0 mlcod 0'0
 active+recovery
 _wait+degraded m=1 

Re: [ceph-users] Performance is really bad when I run from vstart.sh

2014-07-02 Thread David Zafman

By default the vstart.sh setup would put all data below a directory called 
“dev” in the source tree.  In that case you’re using a single spindle.  The 
vstart script isn’t intended for performance testing.

David Zafman
Senior Developer
http://www.inktank.com
http://www.redhat.com

On Jul 2, 2014, at 5:48 PM, Zhe Zhang zhe_zh...@symantec.com wrote:

 Hi folks,
  
 I run ceph on a single node which contains 25 hard drives and each @7200 RPM. 
 I write raw data into the array, it achieved 2 GB/s. I presumed the 
 performance of ceph could go beyond 1 GB/s. but when I compile and ceph code 
 and run development mode with vstart.sh, the average throughput is only 200 
 MB/s for rados bench write.
 I suspected it was due to the debug mode when I configure the source code, 
 and I disable the gdb with ./configure CFLAGS=’-O3’ CXXFLAGS=’O3’ (avoid ‘–g’ 
 flag). But it did not help at all.
 I switched to the repository, and install ceph with ceph-deploy, the 
 performance achieved 800 MB/s. Since I did not successfully set up the ceph 
 with ceph-deploy, and there are still some pg at “creating+incomplete” state, 
 I guess this could impact the performance.
 Anyway, could someone give me some suggestions? Why it is so slow when I run 
 from vstart.sh?
  
 Best,
 Zhe
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Troubles with a fireflay test installation

2014-06-25 Thread David Zafman

Create a 3rd OSD.  The default pool size is 3 replicas including the initial 
system created pools.

David Zafman
Senior Developer
http://www.inktank.com
http://www.redhat.com

On Jun 25, 2014, at 3:04 AM, Iban Cabrillo cabri...@ifca.unican.es wrote:

 Dear,
   I am trying to deploy a new test following the instructions. for the latest 
 firefly version under yum repo.
 
   Installing : ceph-libs-0.80.1-2.el6.x86_64
   Installing : ceph-0.80.1-2.el6.x86_64   
 
   The initial setups contains 3 mon and little osds (1GB per journal)
 
   The cluster has been created correctly.
   The OSD's too with a little trouble (I had to make by hand the osd dir 
 under /var/lib/ceph/, mayby this is a bug) :
   
 [ceph02][WARNIN] ceph-disk: Error: unable to create symlink 
 /var/lib/ceph/osd/ceph-0 - /data/osd0
 [ceph03][WARNIN] ceph-disk: Error: unable to create symlink 
 /var/lib/ceph/osd/ceph-1 - /data/osd1
 
 Now the status show:
 
   [ceph@cephadm ceph-cloud]$ sudo ceph status
 cluster 344c60e2-cef8-41f3-92ae-1995b0abc870
  health HEALTH_WARN 192 pgs incomplete; 192 pgs stuck inactive; 192 pgs 
 stuck unclean
  monmap e2: 3 mons at 
 {ceph02=10.10.3.2:6789/0,ceph03=10.10.3.3:6789/0,cephadm=10.10.3.1:6789/0}, 
 election epoch 6, quorum 0,1,2 cephadm,ceph02,ceph03
  osdmap e8: 2 osds: 2 up, 2 in
   pgmap v15: 192 pgs, 3 pools, 0 bytes data, 0 objects
 2120 MB used, 1705 MB / 4030 MB avail
  192 incomplete
 
 [ceph@cephadm ceph-cloud]$ ceph osd tree
 # id  weight  type name   up/down reweight
 -10   root default
 -20   host ceph02
 0 0   osd.0   up  1   
 -30   host ceph03
 1 0   osd.1   up  1   
 
 [ceph@cephadm ceph-cloud]$ ceph osd dump
 epoch 8
 fsid 344c60e2-cef8-41f3-92ae-1995b0abc870
 created 2014-06-25 11:03:51.830572
 modified 2014-06-25 11:11:57.789185
 flags 
 pool 0 'data' replicated size 3 min_size 2 crush_ruleset 0 object_hash 
 rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 flags hashpspool 
 crash_replay_interval 45 stripe_width 0
 pool 1 'metadata' replicated size 3 min_size 2 crush_ruleset 0 object_hash 
 rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 flags hashpspool 
 stripe_width 0
 pool 2 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash 
 rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 flags hashpspool 
 stripe_width 0
 max_osd 2
 osd.0 up   in  weight 1 up_from 4 up_thru 5 down_at 0 last_clean_interval 
 [0,0) 10.10.3.2:6800/16295 10.10.3.2:6801/16295 10.10.3.2:6802/16295 
 10.10.3.2:6803/16295 exists,up ef41c1d6-4510-44fd-af48-81986c0f6a1e
 osd.1 up   in  weight 1 up_from 8 up_thru 0 down_at 0 last_clean_interval 
 [0,0) 10.10.3.3:6800/14463 10.10.3.3:6801/14463 10.10.3.3:6802/14463 
 10.10.3.3:6803/14463 exists,up 6f2b4d9f-8b7
 
 
 [global]
 auth_service_required = cephx
 filestore_xattr_use_omap = true
 auth_client_required = cephx
 auth_cluster_required = cephx
 mon_host = 10.10.3.1,10.10.3.2,10.10.3.3
 mon_initial_members = cephadm, ceph02, ceph03
 fsid = 344c60e2-cef8-41f3-92ae-1995b0abc870
 
 [default]
 osd pool default size = 2
 
 [osd]
 osd journal size = 1024
 
 Any Idea why there is no  active + clean state?
 
 I have tried to create a storage pool and mount from ceph client, but all the 
 rbd commands hang forever...
 
 regards, I
 Bertrand Russell:
 El problema con el mundo es que los estúpidos están seguros de todo y los 
 inteligentes están llenos de dudas
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deep scrub versus osd scrub load threshold

2014-06-24 Thread David Zafman

Unfortunately, decreasing the osd_scrub_max_interval to 6 days isn’t going to 
fix it.

There is sort of quirk in the way the deep scrub is initiated.  It doesn’t 
trigger a deep scrub until a regular scrub is about to start.  So with 
osd_scrub_max_interval set to 1 week and a high load the next possible scrub or 
deep-scrub is 1 week from the last REGULAR scrub, even if the last deep scrub 
was more than 7 days ago.  

The longest wait for a deep scrub is osd_scrub_max_interval + 
osd_deep_scrub_interval between deep scrubs.

For example, a deep scrub happens on Jan 1.  Each day after that for six days a 
regular scrub happens with low load.  After 6 regular scrubs ending on Jan 7 
the load goes high.  Now with the load high no scrub can start until Jan 14 
because you must get past osd_scrub_max_interval since the last regular scrub 
on Jan 7.  At that time it will be a deep scrub because it is more than 7 days 
since the last deep scrub on Jan 1.

See also http://tracker.ceph.com/issues/6735

There may be a need for more documentation clarification in this area or a 
change to the behavior.

David Zafman
Senior Developer
http://www.inktank.com
http://www.redhat.com

On Jun 23, 2014, at 11:10 PM, Christian Balzer ch...@gol.com wrote:

 
 
 Hello,
 
 On Mon, 23 Jun 2014 21:50:50 -0700 David Zafman wrote:
 
 
 By default osd_scrub_max_interval and osd_deep_scrub_interval are 1 week
 604800 seconds (60*60*24*7) and osd_scrub_min_interval is 1 day 86400
 seconds (60*60*24).  As long as osd_scrub_max_interval =
 osd_deep_scrub_interval then the load won’t impact when deep scrub
 occurs.   I suggest that osd_scrub_min_interval =
 osd_scrub_max_interval = osd_deep_scrub_interval.
 
 I’d like to know how you have those 3 values set, so I can confirm that
 this explains the issue.
 
 They are and were unsurprisingly set to the default values.
 
 Now to provide some more information, shortly after the inception of this
 cluster I did initiate a deep scrub on all OSDs on 00:30 on a Sunday
 morning (the things we do for Ceph, a scheduler with a variety of rules
 would be nice, but I digress). 
 This took until 05:30 despite the cluster being idle and with close to no
 data in it. In retrospect it seems clear to me that this already was
 influenced by the load threshold (a scrub I initiated with the new
 threshold value of 1.5 finished in just 30 minutes last night).
 Consequently all the normal scrubs happened in the same time frame until
 this weekend on the 21st (normal scrub).
 The deep scrub on the 22nd clearly ran into the load threshold.
 
 So if I understand you correctly setting osd_scrub_max_interval to 6 days
 should have deep scrubs ignore the load threshold as per the documentation?
 
 Regards,
 
 Christian
 
 
 David Zafman
 Senior Developer
 http://www.inktank.com
 http://www.redhat.com
 
 On Jun 23, 2014, at 7:01 PM, Christian Balzer ch...@gol.com wrote:
 
 
 Hello,
 
 On Mon, 23 Jun 2014 14:20:37 -0400 Gregory Farnum wrote:
 
 Looks like it's a doc error (at least on master), but it might have
 changed over time. If you're running Dumpling we should change the
 docs.
 
 Nope, I'm running 0.80.1 currently.
 
 Christian
 
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com
 
 
 On Sun, Jun 22, 2014 at 10:18 PM, Christian Balzer ch...@gol.com
 wrote:
 
 Hello,
 
 This weekend I noticed that the deep scrubbing took a lot longer than
 usual (long periods without a scrub running/finishing), even though
 the cluster wasn't all that busy.
 It was however busier than in the past and the load average was above
 0.5 frequently.
 
 Now according to the documentation osd scrub load threshold is
 ignored when it comes to deep scrubs.
 
 However after setting it to 1.5 and restarting the OSDs the
 floodgates opened and all those deep scrubs are now running at full
 speed.
 
 Documentation error or did I unstuck something by the OSD restart?
 
 Regards,
 
 Christian
 --
 Christian BalzerNetwork/Systems Engineer
 ch...@gol.com   Global OnLine Japan/Fusion Communications
 http://www.gol.com/
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 -- 
 Christian BalzerNetwork/Systems Engineer
 ch...@gol.com   Global OnLine Japan/Fusion Communications
 http://www.gol.com/
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 -- 
 Christian BalzerNetwork/Systems Engineer
 ch...@gol.com Global OnLine Japan/Fusion Communications
 http://www.gol.com/

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deep scrub versus osd scrub load threshold

2014-06-23 Thread David Zafman

By default osd_scrub_max_interval and osd_deep_scrub_interval are 1 week 604800 
seconds (60*60*24*7) and osd_scrub_min_interval is 1 day 86400 seconds 
(60*60*24).  As long as osd_scrub_max_interval = osd_deep_scrub_interval then 
the load won’t impact when deep scrub occurs.   I suggest that 
osd_scrub_min_interval = osd_scrub_max_interval = osd_deep_scrub_interval.

I’d like to know how you have those 3 values set, so I can confirm that this 
explains the issue.


David Zafman
Senior Developer
http://www.inktank.com
http://www.redhat.com

On Jun 23, 2014, at 7:01 PM, Christian Balzer ch...@gol.com wrote:

 
 Hello,
 
 On Mon, 23 Jun 2014 14:20:37 -0400 Gregory Farnum wrote:
 
 Looks like it's a doc error (at least on master), but it might have
 changed over time. If you're running Dumpling we should change the
 docs.
 
 Nope, I'm running 0.80.1 currently.
 
 Christian
 
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com
 
 
 On Sun, Jun 22, 2014 at 10:18 PM, Christian Balzer ch...@gol.com wrote:
 
 Hello,
 
 This weekend I noticed that the deep scrubbing took a lot longer than
 usual (long periods without a scrub running/finishing), even though the
 cluster wasn't all that busy.
 It was however busier than in the past and the load average was above
 0.5 frequently.
 
 Now according to the documentation osd scrub load threshold is
 ignored when it comes to deep scrubs.
 
 However after setting it to 1.5 and restarting the OSDs the floodgates
 opened and all those deep scrubs are now running at full speed.
 
 Documentation error or did I unstuck something by the OSD restart?
 
 Regards,
 
 Christian
 --
 Christian BalzerNetwork/Systems Engineer
 ch...@gol.com   Global OnLine Japan/Fusion Communications
 http://www.gol.com/
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 -- 
 Christian BalzerNetwork/Systems Engineer
 ch...@gol.com Global OnLine Japan/Fusion Communications
 http://www.gol.com/
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What exactly is the kernel rbd on osd issue?

2014-06-12 Thread David Zafman

This was commented on recently on ceph-users, but I’ll explain the scenario.

If the single kernel needs to flush rbd blocks to reclaim memory and the OSD 
process needs memory to handle the flushes, you end up deadlocked.

If you run the rbd client in a VM with dedicated memory allocation from the 
point of view of the host kernel, this won’t happen.

David Zafman
Senior Developer
http://www.inktank.com
http://www.redhat.com

On Jun 12, 2014, at 6:33 PM, lists+c...@deksai.com wrote:

 I remember reading somewhere that the kernel ceph clients (rbd/fs) could
 not run on the same host as the OSD.  I tried finding where I saw that,
 and could only come up with some irc chat logs.
 
 The issue stated there is that there can be some kind of deadlock.  Is
 this true, and if so, would you have to run a totally different kernel
 in a vm, or would some form of namespacing be enough to avoid it?
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG Selection Criteria for Deep-Scrub

2014-06-11 Thread David Zafman

The code checks the pg with the oldest scrub_stamp/deep_scrub_stamp to see 
whether the osd_scrub_min_interval/osd_deep_scrub_interval time has elapsed.  
So the output you are showing with the very old scrub stamps shouldn’t happen 
under default settings.  As soon set deep-scrub is re-enabled, the 5 pgs with 
that old stamp should be the first to get run.

A PG needs to have active and clean set to be scrubbed.   If any weren’t 
active+clean, then even a manual scrub would do nothing.

Now that I’m looking at the code I see that your symptom is possible if the 
values of osd_scrub_min_interval or osd_scrub_max_interval are larger than your 
osd_deep_scrub_interval.  Should the osd_scrub_min_interval be greater than 
osd_deep_scrub_interval, there won't be a deep scrub until the 
osd_scrub_min_interval has elapsed.  If an OSD is under load and the 
osd_scrub_max_interval is greater than the osd_deep_scrub_interval, there won't 
be a deep scrub until osd_scrub_max_interval has elapsed.

Please check the 3 interval config values.  Verify that your PGs are 
active+clean just to be sure.

David


On May 20, 2014, at 5:21 PM, Mike Dawson mike.daw...@cloudapt.com wrote:

 Today I noticed that deep-scrub is consistently missing some of my Placement 
 Groups, leaving me with the following distribution of PGs and the last day 
 they were successfully deep-scrubbed.
 
 # ceph pg dump all | grep active | awk '{ print $20}' | sort -k1 | uniq -c
  5 2013-11-06
221 2013-11-20
  1 2014-02-17
 25 2014-02-19
 60 2014-02-20
  4 2014-03-06
  3 2014-04-03
  6 2014-04-04
  6 2014-04-05
 13 2014-04-06
  4 2014-04-08
  3 2014-04-10
  2 2014-04-11
 50 2014-04-12
 28 2014-04-13
 14 2014-04-14
  3 2014-04-15
 78 2014-04-16
 44 2014-04-17
  8 2014-04-18
  1 2014-04-20
 16 2014-05-02
 69 2014-05-04
140 2014-05-05
569 2014-05-06
   9231 2014-05-07
103 2014-05-08
514 2014-05-09
   1593 2014-05-10
393 2014-05-16
   2563 2014-05-17
   1283 2014-05-18
   1640 2014-05-19
   1979 2014-05-20
 
 I have been running the default osd deep scrub interval of once per week, 
 but have disabled deep-scrub on several occasions in an attempt to avoid the 
 associated degraded cluster performance I have written about before.
 
 To get the PGs longest in need of a deep-scrub started, I set the 
 nodeep-scrub flag, and wrote a script to manually kick off deep-scrub 
 according to age. It is processing as expected.
 
 Do you consider this a feature request or a bug? Perhaps the code that 
 schedules PGs to deep-scrub could be improved to prioritize PGs that have 
 needed a deep-scrub the longest.
 
 Thanks,
 Mike Dawson
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problem with ceph_filestore_dump, possibly stuck in a loop

2014-05-20 Thread David Zafman

It isn’t clear to me what could cause a loop there.  Just to be sure you don’t 
have a filesystem corruption please try to run a “find” or “ls -R” on the 
filestore root directory to be sure it completes.

Can you send the log you generated?  Also, what version of Ceph are you running?

David Zafman
Senior Developer
http://www.inktank.com

On May 16, 2014, at 6:20 AM, Jeff Bachtel jbach...@bericotechnologies.com 
wrote:

 Overnight, I tried to use ceph_filestore_dump to export a pg that is missing 
 from other osds from an osd, with the intent of manually copying the export 
 to the osds in the pg map and importing.
 
 Unfortunately, what is on-disk 59gb of data had filled 1TB when I got in this 
 morning, and still hadn't completed. Is it possible for a loop to develop in 
 a ceph_filestore_dump export?
 
 My C++ isn't the best. I can see in ceph_filestore_dump.cc int export_files a 
 loop could occur if a broken collection was read, possibly. Maybe.
 
 --debug output seems to confirm?
 
 grep '^read' /tmp/ceph_filestore_dump.out  | sort | wc -l ; grep '^read' 
 /tmp/ceph_filestore_dump.out  | sort | uniq | wc -l
 2714
 258
 
 (only 258 unique reads are being reported, but each repeated  10 times so 
 far)
 
 From start of debug output
 
 Supported features: compat={},rocompat={},incompat={1=initial feature 
 set(~v.18),2=pginfo object,3=object 
 locator,4=last_epoch_clean,5=categories,6=hobjectpool,7=biginfo,8=leveldbinfo,9=leveldblog,10=snapmapper,11=sharded
  objects}
 On-disk features: compat={},rocompat={},incompat={1=initial feature 
 set(~v.18),2=pginfo object,3=object 
 locator,4=last_epoch_clean,5=categories,6=hobjectpool,7=biginfo,8=leveldbinfo,9=leveldblog,10=snapmapper}
 Exporting 0.2f
 read 8210002f/100d228.00019150/head//0
 size=4194304
 data section offset=1048576 len=1048576
 data section offset=2097152 len=1048576
 data section offset=3145728 len=1048576
 data section offset=4194304 len=1048576
 attrs size 2
 
 then at line 1810
 ead 8210002f/100d228.00019150/head//0
 size=4194304
 data section offset=1048576 len=1048576
 data section offset=2097152 len=1048576
 data section offset=3145728 len=1048576
 data section offset=4194304 len=1048576
 attrs size 2
 
 
 If this is a loop due to a broken filestore, is there any recourse on 
 repairing it? The osd I'm trying to dump from isn't in the pg map for the 
 cluster, I'm trying to save some data by exporting this version of the pg and 
 importing it on an osd that's mapped. If I'm failing at a basic premise even 
 trying to do that, please let me know so I can wave off (in which case, I 
 believe I'd use ceph_filestore_dump to delete all copies of this pg in the 
 cluster so I can force create it, which is failing at this time).
 
 Thanks,
 
 Jeff
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd_recovery_max_single_start

2014-04-28 Thread David Zafman


On Apr 24, 2014, at 10:09 AM, Chad Seys cws...@physics.wisc.edu wrote:

 Hi David,
  Thanks for the reply.
  I'm a little confused by OSD versus PGs in the description of the two 
 options osd_recovery_max_single_start and osd_recovery_max_active .

An OSD manages all the PGs in its object store (a subset of all PGs in the 
cluster).  An OSD only needs to manage recovery of the PGs for which it is 
primary and need recovery.   

 
 The ceph webpage describes osd_recovery_max_active as The number of active 
 recovery requests per OSD at one time. It does not mention PGs. ?
 
 Assuming you meant OSD instead of PG, is this a rephrase of your message:
 
 osd_recovery_max_active (default 15) recovery operations will run total and 
 will be started in groups of osd_recovery_max_single_start (default 5)”

Yes, but PGs are the way the newly started recovery ops group.

The osd_recovery_max_active is the number of recovery operations which can be 
active at any given time for an OSD for all the PGs it is simultaneously 
recovering.

The osd_recovery_max_single_start is the maximum number of recovery operations 
that will be newly started per PG that the OSD is recovering.

 
 So if I set osd_recovery_max_active = 1 then osd_recovery_max_single_start 
 will effectively = 1 ?


Yes, if osd_recovery_max_active = osd_recovery_max_single_start then with no 
ops are currently active we could only start the osd_recovery_max_active new 
ops anyway.

 
 Thanks!
 Chad.
 
 On Thursday, April 24, 2014 11:43:47 you wrote:
 The value of osd_recovery_max_single_start (default 5) is used in
 conjunction with osd_recovery_max_active (default 15).   This means that a
 given PG will start up to 5 recovery operations at time of a total of 15
 operations active at a time.  This allows recovery to spread operations
 across more or less PGs at any given time.
 
 David Zafman
 Senior Developer
 http://www.inktank.com
 
 On Apr 24, 2014, at 8:09 AM, Chad Seys cws...@physics.wisc.edu wrote:
 Hi All,
 
  What does osd_recovery_max_single_start do?  I could not find a
  description
 
 of it.
 
 Thanks!
 Chad.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



David Zafman
Senior Developer
http://www.inktank.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd_recovery_max_single_start

2014-04-24 Thread David Zafman

The value of osd_recovery_max_single_start (default 5) is used in conjunction 
with osd_recovery_max_active (default 15).   This means that a given PG will 
start up to 5 recovery operations at time of a total of 15 operations active at 
a time.  This allows recovery to spread operations across more or less PGs at 
any given time.

David Zafman
Senior Developer
http://www.inktank.com




On Apr 24, 2014, at 8:09 AM, Chad Seys cws...@physics.wisc.edu wrote:

 Hi All,
   What does osd_recovery_max_single_start do?  I could not find a description 
 of it.
 
 Thanks!
 Chad.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inconsistent pgs after update to 0.73 - 0.74

2014-01-09 Thread David Zafman

With pool size of 1 the scrub can still do some consistency checking.  These 
are things like missing attributes, on-disk size doesn’t match attribute size, 
non-clone without a head, expected clone.  You could check the osd logs to see 
what they were.

The pg below only had 1 object in error and it was detected after 2013-12-13 
15:38:13.283741 which was the last clean scrub.

David Zafman
Senior Developer
http://www.inktank.com




On Jan 9, 2014, at 6:36 PM, Mark Kirkwood mark.kirkw...@catalyst.net.nz wrote:

 I've noticed this on 2 (development) clusters that I have with pools having 
 size 1. I guess my first question would be - is this expected?
 
 Here's some detail from one of the clusters:
 
 $ ceph -v
 ceph version 0.74-621-g6fac2ac (6fac2acc5e6f77651ffcd7dc7aa833713517d8a6)
 
 $ ceph osd dump
 epoch 104
 fsid 4e8548e8-dfe4-46d0-a2e8-fb4a9beadff2
 created 2013-11-08 11:01:57.051773
 modified 2014-01-10 14:55:07.353514
 flags
 
 pool 0 'data' replicated size 1 min_size 1 crush_ruleset 0 object_hash 
 rjenkins pg_num 64 pgp_num 64 last_change 47 owner 0 crash_replay_interval 45
 pool 1 'metadata' replicated size 1 min_size 1 crush_ruleset 1 object_hash 
 rjenkins pg_num 64 pgp_num 64 last_change 49 owner 0
 pool 2 'rbd' replicated size 1 min_size 1 crush_ruleset 2 object_hash 
 rjenkins pg_num 64 pgp_num 64 last_change 51 owner 0
 
 max_osd 2
 osd.0 up   in  weight 1 up_from 102 up_thru 102 down_at 101 
 last_clean_interval [97,100) 192.168.2.63:6800/27931 192.168.2.63:6801/27931 
 192.168.2.63:6802/27931 192.168.2.63:6803/27931 exists,up 
 3c23570a-8a46-4ff2-9dab-cbbe82138bf7
 osd.1 up   in  weight 1 up_from 103 up_thru 103 down_at 100 
 last_clean_interval [98,99) 192.168.2.63:6805/28070 192.168.2.63:6806/28070 
 192.168.2.63:6807/28070 192.168.2.63:6808/28070 exists,up 
 73470b19-cd55-4881-a070-55efa06f3df3
 
 $ ceph -s
cluster 4e8548e8-dfe4-46d0-a2e8-fb4a9beadff2
 health HEALTH_ERR 62 pgs inconsistent; 62 scrub errors; crush map has 
 non-optimal tunables
 monmap e1: 1 mons at {vedavec=192.168.2.63:6789/0}, election epoch 1, 
 quorum 0 vedavec
 osdmap e104: 2 osds: 2 up, 2 in
  pgmap v4101: 192 pgs, 3 pools, 9691 MB data, 2439 objects
9833 MB used, 179 GB / 198 GB avail
 130 active+clean
  62 active+clean+inconsistent
 
 
 There are no dmesg errors, and performing a repair on each inconsistent pg 
 removes the inconsistent state. I note with this cluster above that I have 
 non optimal tunables, however (I think) I sorted that on the other one, which 
 had no effect on the inconsistent pgs.
 
 Here's a query from one inconsistent pg:
 
 $ ceph pg 2.3f query
 { state: active+clean+inconsistent,
  epoch: 104,
  up: [
0],
  acting: [
0],
  actingbackfill: [
0],
  info: { pgid: 2.3f,
  last_update: 57'195468,
  last_complete: 57'195468,
  log_tail: 57'192467,
  last_user_version: 195468,
  last_backfill: MAX,
  purged_snaps: [],
  history: { epoch_created: 1,
  last_epoch_started: 103,
  last_epoch_clean: 103,
  last_epoch_split: 0,
  same_up_since: 102,
  same_interval_since: 102,
  same_primary_since: 102,
  last_scrub: 57'195468,
  last_scrub_stamp: 2014-01-10 14:59:23.195403,
  last_deep_scrub: 57'195468,
  last_deep_scrub_stamp: 2014-01-08 11:48:47.559227,
  last_clean_scrub_stamp: 2013-12-13 15:38:13.283741},
  stats: { version: 57'195468,
  reported_seq: 531759,
  reported_epoch: 104,
  state: active+clean+inconsistent,
  last_fresh: 2014-01-10 14:59:23.195452,
  last_change: 2014-01-10 14:59:23.195452,
  last_active: 2014-01-10 14:59:23.195452,
  last_clean: 2014-01-10 14:59:23.195452,
  last_became_active: 0.00,
  last_unstale: 2014-01-10 14:59:23.195452,
  mapping_epoch: 101,
  log_start: 57'192467,
  ondisk_log_start: 57'192467,
  created: 1,
  last_epoch_clean: 103,
  parent: 0.0,
  parent_split_bits: 0,
  last_scrub: 57'195468,
  last_scrub_stamp: 2014-01-10 14:59:23.195403,
  last_deep_scrub: 57'195468,
  last_deep_scrub_stamp: 2014-01-08 11:48:47.559227,
  last_clean_scrub_stamp: 2013-12-13 15:38:13.283741,
  log_size: 3001,
  ondisk_log_size: 3001,
  stats_invalid: 0,
  stat_sum: { num_bytes: 180355088,
  num_objects: 44,
  num_object_clones: 0,
  num_object_copies: 44,
  num_objects_missing_on_primary: 0,
  num_objects_degraded: 0,
  num_objects_unfound: 0,
  num_objects_dirty: 0,
  num_whiteouts: 0,
  num_read: 143836,
  num_read_kb: 3135807,
  num_write: 195467,
  num_write_kb: 2994821,
  num_scrub_errors: 1

Re: [ceph-users] repair incosistent pg using emperor

2014-01-06 Thread David Zafman

Did the inconsistent flag eventually get cleared?  It might have been you 
didn’t wait long enough for the repair to get through the pg.

David Zafman
Senior Developer
http://www.inktank.com




On Dec 28, 2013, at 12:29 PM, Corin Langosch corin.lango...@netskin.com wrote:

 Hi Sage,
 
 Am 28.12.2013 19:18, schrieb Sage Weil:
 
  ceph pg scrub 6.29f
 
 ...and see if it comes back with errors or not.  If it doesn't, you
 can
 What do you mean with comnes back with error or not?
 
 ~# ceph pg scrub 6.29f
 instructing pg 6.29f on osd.8 to scrub
 
 But the logs don't show any scrubbing.  In fact the command doesn't see
 to do anything at all
 
 
  ceph pg repair 6.29f
 
 to clear the inconsistent flag.
 
 ~# ceph pg repair 6.29f
 instructing pg 6.29f on osd.8 to repair
 
 Again, nothing in the logs. It seems the commands are completely ignored?
 
 I already restarted osd 8 two times and tried again, no change...
 
 Corin
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HDD bad sector, pg inconsistent, no object remapping

2013-11-18 Thread David Zafman

No, you wouldn’t need to re-replicate the whole disk for a single bad sector.  
The way to deal with that if the object is on the primary is to remove the file 
manually from the OSD’s filesystem and perform a repair of the PG that holds 
that object.  This will copy the object back from one of the replicas.

David

On Nov 17, 2013, at 10:46 PM, Chris Dunlop ch...@onthe.net.au wrote:

 Hi David,
 
 On Fri, Nov 15, 2013 at 10:00:37AM -0800, David Zafman wrote:
 
 Replication does not occur until the OSD is “out.”  This creates a new 
 mapping in the cluster of where the PGs should be and thus data begins to 
 move and/or create sufficient copies.  This scheme lets you control how and 
 when you want the replication to occur.  If you have plenty of space and you 
 aren’t going to replace the drive immediately, just mark the OSD “down AND 
 “out..  If you are going to replace the drive immediately, set the “noout” 
 flag.  Take the OSD “down” and replace drive.  Assuming it is mounted in the 
 same place as the bad drive, bring the OSD back up.  This will replicate 
 exactly the same PGs the bad drive held back to the replacement drive.  As 
 was stated before don’t forget to “ceph osd unset noout
 
 Keep in mind that in the case of a machine that has a hardware failure and 
 takes OSD(s) down there is an automatic timeout which will mark them “out 
 for unattended operation.  Unless you are monitoring the cluster 24/7 you 
 should have enough disk space available to handle failures.
 
 Related info in:
 
 http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/
 
 David Zafman
 Senior Developer
 http://www.inktank.com
 
 
 Are you saying, if a disk suffers from a bad sector in an object
 for which it's primary, and for which good data exists on other
 replica PGs, there's no way for ceph to recover other than by
 (re-)replicating the whole disk?
 
 I.e., even if the disk is able to remap the bad sector using a
 spare, so the disk is ok (albeit missing a sector's worth of
 object data), the only way to recover is to basically blow away
 all the data on that disk and start again, replicating
 everything back to the disk (or to other disks)?
 
 Cheers,
 
 Chris.






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HDD bad sector, pg inconsistent, no object remapping

2013-11-18 Thread David Zafman

I looked at the code.  The automatic repair should handle getting an EIO during 
read of the object replica.  It does NOT require removing the object as I said 
before, so it doesn’t matter which copy has bad sectors.  It will copy from a 
good replica to the primary, if necessary.  By default a deep-scrub which would 
catch this case is performed weekly.  A repair must be initiated by 
administrative action.

When replicas differ due to comparison of checksums, we currently don’t have a 
way to determine which copy(s) are corrupt.  This is where a manual 
intervention may be necessary if the administrator can determine which copy(s) 
are bad.

David Zafman
Senior Developer
http://www.inktank.com




On Nov 18, 2013, at 1:11 PM, Chris Dunlop ch...@onthe.net.au wrote:

 OK, that's good (as far is it goes, being a manual process).
 
 So then, back to what I think was Mihály's original issue:
 
 pg repair or deep-scrub can not fix this issue. But if I
 understand correctly, osd has to known it can not retrieve
 object from osd.0 and need to be replicate an another osd
 because there is no 3 working replicas now.
 
 Given a bad checksum and/or read error tells ceph that an object
 is corrupt, it would seem to be a natural step to then have ceph
 automatically use another good-checksum copy, and even rewrite
 the corrupt object, either in normal operation or under a scub
 or repair.
 
 Is there a reason this isn't done, apart from lack of tuits?
 
 Cheers,
 
 Chris
 
 
 On Mon, Nov 18, 2013 at 11:43:46AM -0800, David Zafman wrote:
 
 No, you wouldn’t need to re-replicate the whole disk for a single bad 
 sector.  The way to deal with that if the object is on the primary is to 
 remove the file manually from the OSD’s filesystem and perform a repair of 
 the PG that holds that object.  This will copy the object back from one of 
 the replicas.
 
 David
 
 On Nov 17, 2013, at 10:46 PM, Chris Dunlop ch...@onthe.net.au wrote:
 
 Hi David,
 
 On Fri, Nov 15, 2013 at 10:00:37AM -0800, David Zafman wrote:
 
 Replication does not occur until the OSD is “out.”  This creates a new 
 mapping in the cluster of where the PGs should be and thus data begins to 
 move and/or create sufficient copies.  This scheme lets you control how 
 and when you want the replication to occur.  If you have plenty of space 
 and you aren’t going to replace the drive immediately, just mark the OSD 
 “down AND “out..  If you are going to replace the drive immediately, set 
 the “noout” flag.  Take the OSD “down” and replace drive.  Assuming it is 
 mounted in the same place as the bad drive, bring the OSD back up.  This 
 will replicate exactly the same PGs the bad drive held back to the 
 replacement drive.  As was stated before don’t forget to “ceph osd unset 
 noout
 
 Keep in mind that in the case of a machine that has a hardware failure and 
 takes OSD(s) down there is an automatic timeout which will mark them “out 
 for unattended operation.  Unless you are monitoring the cluster 24/7 you 
 should have enough disk space available to handle failures.
 
 Related info in:
 
 http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/
 
 David Zafman
 Senior Developer
 http://www.inktank.com
 
 
 Are you saying, if a disk suffers from a bad sector in an object
 for which it's primary, and for which good data exists on other
 replica PGs, there's no way for ceph to recover other than by
 (re-)replicating the whole disk?
 
 I.e., even if the disk is able to remap the bad sector using a
 spare, so the disk is ok (albeit missing a sector's worth of
 object data), the only way to recover is to basically blow away
 all the data on that disk and start again, replicating
 everything back to the disk (or to other disks)?
 
 Cheers,
 
 Chris.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HDD bad sector, pg inconsistent, no object remapping

2013-11-15 Thread David Zafman

Replication does not occur until the OSD is “out.”  This creates a new mapping 
in the cluster of where the PGs should be and thus data begins to move and/or 
create sufficient copies.  This scheme lets you control how and when you want 
the replication to occur.  If you have plenty of space and you aren’t going to 
replace the drive immediately, just mark the OSD “down AND “out..  If you are 
going to replace the drive immediately, set the “noout” flag.  Take the OSD 
“down” and replace drive.  Assuming it is mounted in the same place as the bad 
drive, bring the OSD back up.  This will replicate exactly the same PGs the bad 
drive held back to the replacement drive.  As was stated before don’t forget to 
“ceph osd unset noout

Keep in mind that in the case of a machine that has a hardware failure and 
takes OSD(s) down there is an automatic timeout which will mark them “out for 
unattended operation.  Unless you are monitoring the cluster 24/7 you should 
have enough disk space available to handle failures.

Related info in:

http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/

David Zafman
Senior Developer
http://www.inktank.com

On Nov 15, 2013, at 1:58 AM, Mihály Árva-Tóth 
mihaly.arva-t...@virtual-call-center.eu wrote:

 Hello,
 
 I think you misunderstood me. I known how can I replace bad HDD, thanks. My 
 problem is the following:
 
 Object replica number is 3. Objects that in 11.15d PG which store data on 
 osd.0 bad sectors place inter alia. Ceph should know objects are in 11.15d on 
 osd.0 is bad (because deep-scrub and repair are both failed), so there is no 
 3 clean replica number only two. I think Ceph should replicate its to new 
 osd. Or not?
 
 Thank you,
 Mihaly
 
 2013/11/13 David Zafman david.zaf...@inktank.com
 
 Since the disk is failing and you have 2 other copies I would take osd.0 
 down.  This means that ceph will not attempt to read the bad disk either for 
 clients or to make another copy of the data:
 
 * Not sure about the syntax of this for the version of ceph you are 
 running
 ceph osd down 0
 
 Mark it “out” which will immediately trigger recovery to create more copies 
 of the data with the remaining OSDs.
 ceph osd out 0
 
 You can now finish the process of removing the osd by looking at these 
 instructions:
 
 http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#removing-osds-manual
 
 David Zafman
 Senior Developer
 http://www.inktank.com
 
 On Nov 12, 2013, at 3:16 AM, Mihály Árva-Tóth 
 mihaly.arva-t...@virtual-call-center.eu wrote:
 
  Hello,
 
  I have 3 node, with 3 OSD in each node. I'm using .rgw.buckets pool with 3 
  replica. One of my HDD (osd.0) has just bad sectors, when I try to read an 
  object from OSD direct, I get Input/output errror. dmesg:
 
  [1214525.670065] mpt2sas0: log_info(0x3108): originator(PL), 
  code(0x08), sub_code(0x)
  [1214525.670072] mpt2sas0: log_info(0x3108): originator(PL), 
  code(0x08), sub_code(0x)
  [1214525.670100] sd 0:0:2:0: [sdc] Unhandled sense code
  [1214525.670104] sd 0:0:2:0: [sdc]
  [1214525.670107] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
  [1214525.670110] sd 0:0:2:0: [sdc]
  [1214525.670112] Sense Key : Medium Error [current]
  [1214525.670117] Info fld=0x60c8f21
  [1214525.670120] sd 0:0:2:0: [sdc]
  [1214525.670123] Add. Sense: Unrecovered read error
  [1214525.670126] sd 0:0:2:0: [sdc] CDB:
  [1214525.670128] Read(16): 88 00 00 00 00 00 06 0c 8f 20 00 00 00 08 00 00
 
  Okay I known need to replace HDD.
 
  Fragment of ceph -s  output:
pgmap v922039: 856 pgs: 855 active+clean, 1 active+clean+inconsistent;
 
  ceph pg dump | grep inconsistent
 
  11.15d  25443   0   0   0   6185091790  30013001
  active+clean+inconsistent   2013-11-06 02:30:45.23416.
 
  ceph pg map 11.15d
 
  osdmap e1600 pg 11.15d (11.15d) - up [0,8,3] acting [0,8,3]
 
  pg repair or deep-scrub can not fix this issue. But if I understand 
  correctly, osd has to known it can not retrieve object from osd.0 and need 
  to be replicate an another osd because there is no 3 working replicas now.
 
  Thank you,
  Mihaly
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HDD bad sector, pg inconsistent, no object remapping

2013-11-12 Thread David Zafman

Since the disk is failing and you have 2 other copies I would take osd.0 down.  
This means that ceph will not attempt to read the bad disk either for clients 
or to make another copy of the data:

* Not sure about the syntax of this for the version of ceph you are running
ceph osd down 0

Mark it “out” which will immediately trigger recovery to create more copies of 
the data with the remaining OSDs.
ceph osd out 0

You can now finish the process of removing the osd by looking at these 
instructions:

http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#removing-osds-manual

David Zafman
Senior Developer
http://www.inktank.com

On Nov 12, 2013, at 3:16 AM, Mihály Árva-Tóth 
mihaly.arva-t...@virtual-call-center.eu wrote:

 Hello,
 
 I have 3 node, with 3 OSD in each node. I'm using .rgw.buckets pool with 3 
 replica. One of my HDD (osd.0) has just bad sectors, when I try to read an 
 object from OSD direct, I get Input/output errror. dmesg:
 
 [1214525.670065] mpt2sas0: log_info(0x3108): originator(PL), code(0x08), 
 sub_code(0x)
 [1214525.670072] mpt2sas0: log_info(0x3108): originator(PL), code(0x08), 
 sub_code(0x)
 [1214525.670100] sd 0:0:2:0: [sdc] Unhandled sense code
 [1214525.670104] sd 0:0:2:0: [sdc]  
 [1214525.670107] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
 [1214525.670110] sd 0:0:2:0: [sdc]  
 [1214525.670112] Sense Key : Medium Error [current] 
 [1214525.670117] Info fld=0x60c8f21
 [1214525.670120] sd 0:0:2:0: [sdc]  
 [1214525.670123] Add. Sense: Unrecovered read error
 [1214525.670126] sd 0:0:2:0: [sdc] CDB: 
 [1214525.670128] Read(16): 88 00 00 00 00 00 06 0c 8f 20 00 00 00 08 00 00
 
 Okay I known need to replace HDD.
 
 Fragment of ceph -s  output:
   pgmap v922039: 856 pgs: 855 active+clean, 1 active+clean+inconsistent;
 
 ceph pg dump | grep inconsistent
 
 11.15d  25443   0   0   0   6185091790  30013001
 active+clean+inconsistent   2013-11-06 02:30:45.23416.
 
 ceph pg map 11.15d
 
 osdmap e1600 pg 11.15d (11.15d) - up [0,8,3] acting [0,8,3]
 
 pg repair or deep-scrub can not fix this issue. But if I understand 
 correctly, osd has to known it can not retrieve object from osd.0 and need to 
 be replicate an another osd because there is no 3 working replicas now.
 
 Thank you,
 Mihaly
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Very unbalanced osd data placement with differing sized devices

2013-10-16 Thread David Zafman

I may be wrong, but I always thought that a weight of 0 means don't put 
anything there.  All weights  0 will be looked at proportionally.

See http://ceph.com/docs/master/rados/operations/crush-map/ which recommends 
higher weights anyway:

Weighting Bucket Items

Ceph expresses bucket weights as double integers, which allows for fine 
weighting. A weight is the relative difference between device capacities. We 
recommend using 1.00 as the relative weight for a 1TB storage device. In such a 
scenario, a weight of 0.5 would represent approximately 500GB, and a weight of 
3.00 would represent approximately 3TB. Higher level buckets have a weight that 
is the sum total of the leaf items aggregated by the bucket.

A bucket item weight is one dimensional, but you may also calculate your item 
weights to reflect the performance of the storage drive. For example, if you 
have many 1TB drives where some have relatively low data transfer rate and the 
others have a relatively high data transfer rate, you may weight them 
differently, even though they have the same capacity (e.g., a weight of 0.80 
for the first set of drives with lower total throughput, and 1.20 for the 
second set of drives with higher total throughput).


David Zafman
Senior Developer
http://www.inktank.com




On Oct 16, 2013, at 8:15 PM, Mark Kirkwood mark.kirkw...@catalyst.net.nz 
wrote:

 I stumbled across this today:
 
 4 osds on 4 hosts (names ceph1 - ceph4). They are KVM guests (this is a play 
 setup).
 
 - ceph1 and ceph2 each have a 5G volume for osd data (+ 2G vol for journal)
 - ceph3 and ceph4 each have a 10G volume for osd data (+ 2G vol for journal)
 
 I do a standard installation via ceph-deploy (1.2.7) of ceph (0.67.4) on each 
 one [1]. The topology looks like:
 
 $ ceph osd tree
 # idweighttype nameup/downreweight
 -10.01999root default
 -20host ceph1
 00osd.0up1
 -30host ceph2
 10osd.1up1
 -40.009995host ceph3
 20.009995osd.2up1
 -50.009995host ceph4
 30.009995osd.3up1
 
 So osd.0 and osd.1 (on ceph1,2) have weight 0, and osd2 and osd.3 (on 
 ceph3,4) have weight 0.009995 this suggests that data will flee osd.0,1 and 
 live only on osd.3.4. Sure enough putting in a few objects via radus put 
 results in:
 
 ceph1 $ df -m
 Filesystem 1M-blocks  Used Available Use% Mounted on
 /dev/vda1   5038  2508  2275  53% /
 udev 994 1   994   1% /dev
 tmpfs401 1   401   1% /run
 none   5 0 5   0% /run/lock
 none1002 0  1002   0% /run/shm
 /dev/vdb1   510940  5070   1% /var/lib/ceph/osd/ceph-0
 
 (similarly for ceph2), whereas:
 
 ceph3 $df -m
 Filesystem 1M-blocks  Used Available Use% Mounted on
 /dev/vda1   5038  2405  2377  51% /
 udev 994 1   994   1% /dev
 tmpfs401 1   401   1% /run
 none   5 0 5   0% /run/lock
 none1002 0  1002   0% /run/shm
 /dev/vdb1  10229  1315  8915  13% /var/lib/ceph/osd/ceph-2
 
 (similarly for ceph4). Obviously I can fix this via the reweighting the first 
 two osds to something like 0.005, but I'm wondering if there is something 
 I've missed - clearly some kind of auto weighting is has been performed on 
 the basis of the size difference in the data volumes, but looks to be skewing 
 data far too much to the bigger ones. Is there perhaps a bug in the smarts 
 for this? Or is it just because I'm using small volumes (5G = 0 weight)?
 
 Cheers
 
 Mark
 
 [1] i.e:
 
 $ ceph-deploy new ceph1
 $ ceph-deploy mon create ceph1
 $ ceph-deploy gatherkeys ceph1
 $ ceph-deploy osd create ceph1:/dev/vdb:/dev/vdc
 ...
 $ ceph-deploy osd create ceph4:/dev/vdb:/dev/vdc
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 10/100 network for Mons?

2013-09-19 Thread David Zafman

I believe that the nature of the monitor network traffic should be fine with 
10/100 network ports.

David Zafman
Senior Developer
http://www.inktank.com

On Sep 18, 2013, at 1:24 PM, Gandalf Corvotempesta 
gandalf.corvotempe...@gmail.com wrote:

 Hi to all.
 Actually I'm building a test cluster with 3 OSD servers connected with
 IPoIB for cluster networks and 10GbE for public network.
 
 I have to connect these OSDs to some MONs servers located in another
 rack with no gigabit or 10Gb connection.
 
 Could I use some 10/100 networks ports? Which kind of traffic is
 managed by mons?
 AFAIK, clients will directly connect to the right OSDs (and there I
 have a 10GbE network, so Mons should't require that speed.
 
 Is this right?
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd map issues: no such file or directory (ENOENT) AND map wrong image

2013-08-19 Thread David Zafman

Transferring this back the ceph-users.  Sorry, I can't help with rbd issues.  
One thing I will say is that if you are mounting an rbd device with a 
filesystem on a machine to export ftp, you can't also export the same device 
via iSCSI.

David Zafman
Senior Developer
http://www.inktank.com

On Aug 19, 2013, at 8:39 PM, PJ linalin1...@gmail.com wrote:

 2013/8/14 David Zafman david.zaf...@inktank.com
 
 On Aug 12, 2013, at 7:41 PM, Josh Durgin josh.dur...@inktank.com wrote:
 
  On 08/12/2013 07:18 PM, PJ wrote:
 
  If the target rbd device only map on one virtual machine, format it as
  ext4 and mount to two places
mount /dev/rbd0 /nfs -- for nfs server usage
mount /dev/rbd0 /ftp  -- for ftp server usage
  nfs and ftp servers run on the same virtual machine. Will file system
  (ext4) help to handle the simultaneous access from nfs and ftp?
  
  I doubt that'll work perfectly on a normal disk, although rbd should
  behave the same in this case. Consider what happens when to be some
  issues when the same files are modified at once by the ftp and nfs
  servers. You could run ftp on an nfs client on a different machine
  safely.
 
 
 
 Modern Linux kernels will do a bind mount when a block device is mounted on 2 
 different directories.   Think directory hard links.  Simultaneous access 
 will NOT corrupt ext4, but as Josh said modifying the same file at once by 
 ftp and nfs isn't going produce good results.  With file locking 2 nfs 
 clients could coordinate using advisory locking.
 
 David Zafman
 Senior Developer
 http://www.inktank.com
 
 
 The first issue is reproduced, but there are changes to system configuration. 
 Due to hardware shortage, we only have one physical machine installed one OSD 
 and runs 6 virtual machines on it. Only one monitor (wistor-003) and one FTP 
 server (wistor-004), the other virtual machines are iSCSI servers.
 
 The log size is big because when enable FTP service for a rbd device, we have 
 a rbd map retry loop in case it fails (retry rbd map every 10 sec and last 
 for 3 minutes). Please download monitor log from below link,
 https://www.dropbox.com/s/88cb9q91cjszuug/ceph-mon.wistor-003.log.zip
 
 Here are the operation steps:
 1. The pool rex is created
Around 2013-08-20 09:16:38~09:16:39
 2. The first time to map rbd device on wistor-004 and it fails (all retries 
 failed)
Around 2013-08-20 09:17:43~09:20:46 (180 sec)
 3. Tried second time and it works, but still have 9 fails in retry loop
Around 2013-08-20 09:20:48~09:22:10 (82 sec)
 
 
 
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd map issues: no such file or directory (ENOENT) AND map wrong image

2013-08-13 Thread David Zafman

On Aug 12, 2013, at 7:41 PM, Josh Durgin josh.dur...@inktank.com wrote:

 On 08/12/2013 07:18 PM, PJ wrote:
 
 If the target rbd device only map on one virtual machine, format it as
 ext4 and mount to two places
   mount /dev/rbd0 /nfs -- for nfs server usage
   mount /dev/rbd0 /ftp  -- for ftp server usage
 nfs and ftp servers run on the same virtual machine. Will file system
 (ext4) help to handle the simultaneous access from nfs and ftp?
 
 I doubt that'll work perfectly on a normal disk, although rbd should
 behave the same in this case. Consider what happens when to be some
 issues when the same files are modified at once by the ftp and nfs
 servers. You could run ftp on an nfs client on a different machine
 safely.
 


Modern Linux kernels will do a bind mount when a block device is mounted on 2 
different directories.   Think directory hard links.  Simultaneous access will 
NOT corrupt ext4, but as Josh said modifying the same file at once by ftp and 
nfs isn't going produce good results.  With file locking 2 nfs clients could 
coordinate using advisory locking.  

David Zafman
Senior Developer
http://www.inktank.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph repair details

2013-06-06 Thread David Zafman

Repair does the equivalent of a deep-scrub to find problems.  This mostly is 
reading object data/omap/xattr to create checksums and compares them across all 
copies.  When a discrepancy is identified an arbitrary copy which did not have 
I/O errors is selected and used to re-write the other replicas.


David Zafman
Senior Developer
http://www.inktank.com




On May 25, 2013, at 12:33 PM, Mike Lowe j.michael.l...@gmail.com wrote:

 Does anybody know exactly what ceph repair does?  Could you list out briefly 
 the steps it takes?  I unfortunately need to use it for an inconsistent pg.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inconsistent PG's, repair ineffective

2013-05-21 Thread David Zafman

I can't reproduce this on v0.61-2.  Could the disks for osd.13  osd.22 be 
unwritable?

In your case it looks like the 3rd replica is probably the bad one, since 
osd.13 and osd.22 are the same.  You probably want to manually repair the 3rd 
replica.

David Zafman
Senior Developer
http://www.inktank.com




On May 21, 2013, at 6:45 AM, John Nielsen li...@jnielsen.net wrote:

 Cuttlefish on CentOS 6, ceph-0.61.2-0.el6.x86_64.
 
 On May 21, 2013, at 12:13 AM, David Zafman david.zaf...@inktank.com wrote:
 
 
 What version of ceph are you running?
 
 David Zafman
 Senior Developer
 http://www.inktank.com
 
 On May 20, 2013, at 9:14 AM, John Nielsen li...@jnielsen.net wrote:
 
 Some scrub errors showed up on our cluster last week. We had some issues 
 with host stability a couple weeks ago; my guess is that errors were 
 introduced at that point and a recent background scrub detected them. I was 
 able to clear most of them via ceph pg repair, but several remain. Based 
 on some other posts, I'm guessing that they won't repair because it is the 
 primary copy that has the error. All of our pools are set to size 3 so 
 there _ought_ to be a way to verify and restore the correct data, right?
 
 Below is some log output about one of the problem PG's. Can anyone suggest 
 a way to fix the inconsistencies?
 
 2013-05-20 10:07:54.529582 osd.13 10.20.192.111:6818/20919 3451 : [ERR] 
 19.1b osd.13: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 
 4289025870 != known digest 4190506501
 2013-05-20 10:07:54.529585 osd.13 10.20.192.111:6818/20919 3452 : [ERR] 
 19.1b osd.22: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 
 4289025870 != known digest 4190506501
 2013-05-20 10:07:54.606034 osd.13 10.20.192.111:6818/20919 3453 : [ERR] 
 19.1b repair 0 missing, 1 inconsistent objects
 2013-05-20 10:07:54.606066 osd.13 10.20.192.111:6818/20919 3454 : [ERR] 
 19.1b repair 2 errors, 2 fixed
 2013-05-20 10:07:55.034221 osd.13 10.20.192.111:6818/20919 3455 : [ERR] 
 19.1b osd.13: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 
 4289025870 != known digest 4190506501
 2013-05-20 10:07:55.034224 osd.13 10.20.192.111:6818/20919 3456 : [ERR] 
 19.1b osd.22: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 
 4289025870 != known digest 4190506501
 2013-05-20 10:07:55.113230 osd.13 10.20.192.111:6818/20919 3457 : [ERR] 
 19.1b deep-scrub 0 missing, 1 inconsistent objects
 2013-05-20 10:07:55.113235 osd.13 10.20.192.111:6818/20919 3458 : [ERR] 
 19.1b deep-scrub 2 errors
 
 Thanks,
 
 JN
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH_WARN after upgrade to cuttlefish

2013-05-08 Thread David Zafman

According to osdmap e504: 4 osds: 2 up, 2 in you have 2 of 4 osds that are 
down and out.  That may be the issue.

David Zafman
Senior Developer
http://www.inktank.com

On May 8, 2013, at 12:05 AM, James Harper james.har...@bendigoit.com.au wrote:

 I've just upgraded my ceph install to cuttlefish (was 0.60) from Debian.
 
 My mon's don't regularly die anymore, or at least haven't so far, but health 
 is always HEALTH_WARN even though I can't see any indication of why:
 
 # ceph status
   health HEALTH_WARN
   monmap e1: 3 mons at 
 {4=192.168.200.197:6789/0,7=192.168.200.190:6789/0,8=192.168.200.191:6789/0}, 
 election epoch 1104, quorum 0,1,2 4,7,8
   osdmap e504: 4 osds: 2 up, 2 in
pgmap v210142: 832 pgs: 832 active+clean; 318 GB data, 638 GB used, 1223 
 GB / 1862 GB avail; 4970B/s rd, 7456B/s wr, 2op/s
   mdsmap e577: 1/1/1 up {0=7=up:active}
 
 Anyone have any idea what might be wrong, or where I can look to find out 
 more?
 
 Thanks
 
 James
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help: Ceph upgrade.

2013-04-25 Thread David Zafman

I don't believe that there would be a perceptible increase in data usage.  The 
next release called Cuttlefish is less than a week from release, so you might 
wait for that.

Product questions should go to one of our mailing lists, not directly to 
developers.

David Zafman
Senior Developer
http://www.inktank.com

On Apr 24, 2013, at 11:35 PM, MinhTien MinhTien tientienminh080...@gmail.com 
wrote:

 Hi David Zafman
 
 
 I use ceph - 0.56.4. I want upgrade to version 0.60. 
 
 I have many data in ceph storage. 
 
 If I update. Is my data usage affected?
 
 
 Thanks and Regard
 
 -- 
 Bui Minh Tien

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph error: active+clean+scrubbing+deep

2013-04-24 Thread David Zafman

I'm not sure what the point of running with replication set to 1, but a new 
feature adds ceph commands to turn off scrubbing:

Check ceph --help to see if you have a version that has this.

  ceph osd set noout|noin|nodown|noup|noscrub|nodeep-scrub
  ceph osd unset noout|noin|nodown|noup|noscrub|nodeep-scrub

You might want to turn off both kinds of scrubbing.

ceph osd set noscrub
ceph osd set nodeep-scrub


David Zafman
Senior Developer
http://www.inktank.com

On Apr 16, 2013, at 12:30 AM, kakito tientienminh080...@gmail.com wrote:

 Hi Martin B Nielsen,
 
 Thank you for your quick answer :)
 
 I am running with replication set to 1. Because my server used RAID 6, 
 divided into 4 partitons, earch partiton is 1 OSD, format ext4. I have 2 
 server == 8 OSD.
 
 Do you have any advice? ^^
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Read Benchmark

2013-03-12 Thread David Zafman

Try first doing something like this first.

rados bench -p data 300 write --no-cleanup

David Zafman
Senior Developer
http://www.inktank.com




On Mar 12, 2013, at 1:46 PM, Scott Kinder skin...@yieldex.com wrote:

 When I try and do a rados bench, I see the following error:
 
 # rados bench -p data 300 seq
 Must write data before running a read benchmark!
 error during benchmark: -2
 error 2: (2) No such file or directory
 
 There's been objects written to the data pool. What's required to get the 
 read bench test to work?
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Read Benchmark

2013-03-12 Thread David Zafman

I would either make my own pool and delete it when done:

rados mkpool testpool
RUN BENCHMARKS
rados rmpool testpool testpool --yes-i-really-really-mean-it

or use the cleanup command, but I ended up having to also delete 
benchmark_last_metadata

RUN BENCHMARKS
rados -p data ls
…
# Note the names of the preceding _object#

rados -p data cleanup benchmark_data_ubuntu_#
rados -p data rm benchmark_last_metadata

David Zafman
Senior Developer
http://www.inktank.com




On Mar 12, 2013, at 2:11 PM, Scott Kinder skin...@yieldex.com wrote:

 A follow-up question. How do I cleanup the written data, after I finish up 
 with my benchmarks? I notice there is a cleanup prefix object command, 
 though I'm unclear on how to use it.
 
 
 On Tue, Mar 12, 2013 at 2:59 PM, Scott Kinder skin...@yieldex.com wrote:
 That did the trick, thanks David.
 
 
 On Tue, Mar 12, 2013 at 2:48 PM, David Zafman david.zaf...@inktank.com 
 wrote:
 
 Try first doing something like this first.
 
 rados bench -p data 300 write --no-cleanup
 
 David Zafman
 Senior Developer
 http://www.inktank.com
 
 
 
 
 On Mar 12, 2013, at 1:46 PM, Scott Kinder skin...@yieldex.com wrote:
 
 When I try and do a rados bench, I see the following error:
 
 # rados bench -p data 300 seq
 Must write data before running a read benchmark!
 error during benchmark: -2
 error 2: (2) No such file or directory
 
 There's been objects written to the data pool. What's required to get the 
 read bench test to work?
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] deep scrub

2013-02-27 Thread David Zafman

deep-scrub finds problems it doesn't fix them.  Try:

ceph osd repair osd-id

David Zafman
Senior Developer
david.zaf...@inktank.com



On Feb 27, 2013, at 12:43 AM, Jun Jun8 Liu liuj...@lenovo.com wrote:

 Hi all
 I did a test about deep scrub . Version is  ceph version 0.56.2 
 (586538e22afba85c59beda49789ec42024e7a061)
  the steps are
  1  # ceph pg map 9.3c
 osdmap e1279 pg 9.3c (9.3c) - up [2,9,12] acting [2,9,12]
  2  remove all filesunder dir /data12/current/9.3c_head/  in osd 12
  3  ceph pg deep-scrub 9.3c
 4  # ceph health details
 HEALTH_ERR 1 pgs inconsistent; 36 scrub errors
pg 9.3c is active+clean+inconsistent, acting [2,9,12]
36 scrub errors
  I wait about half hours ,ceph stats still be Health_err.
 
  is there any method to change ceph stat to Health?
  
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com