[ceph-users] ceph plugin balancer error

2018-07-05 Thread Chris Hsiang
Hi,

I am running test on ceph mimic  13.0.2.1874+ge31585919b-lp150.1.2 using
openSUSE-Leap-15.0

when I ran "ceph balancer status", it errors out.

g1:/var/log/ceph # ceph balancer status
Error EIO: Module 'balancer' has experienced an error and cannot handle
commands: 'dict' object has no attribute 'iteritems'

what config need to be done in order to get it work?

Chris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph plugin balancer error

2018-07-05 Thread Chris Hsiang
weird thing is that,

g1:~ # locate /bin/python
/usr/bin/python
/usr/bin/python2
/usr/bin/python2.7
/usr/bin/python3
/usr/bin/python3.6
/usr/bin/python3.6m

g1:~ # ls /usr/bin/python* -al
lrwxrwxrwx 1 root root 9 May 13 07:41 /usr/bin/python -> python2.7
lrwxrwxrwx 1 root root 9 May 13 07:41 /usr/bin/python2 -> python2.7
-rwxr-xr-x 1 root root  6304 May 13 07:41 /usr/bin/python2.7
lrwxrwxrwx 1 root root 9 May 13 08:39 /usr/bin/python3 -> python3.6
-rwxr-xr-x 2 root root 10456 May 13 08:39 /usr/bin/python3.6
-rwxr-xr-x 2 root root 10456 May 13 08:39 /usr/bin/python3.6m

my default python env is 2.7... so under dict object should have
iteritems   method

Chris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph plugin balancer error

2018-07-05 Thread Chris Hsiang
I have tried to modify /usr/lib64/ceph/mgr/balancer/module.py

replace  iteritems () with items(),  but I still got following error

g1:/usr/lib64/ceph/mgr/balancer # ceph balancer status
Error EINVAL: Traceback (most recent call last):
  File "/usr/lib64/ceph/mgr/balancer/module.py", line 297, in handle_command
return (0, json.dumps(s, indent=4), '')
  File "/usr/lib64/python3.6/json/__init__.py", line 238, in dumps
**kw).encode(obj)
  File "/usr/lib64/python3.6/json/encoder.py", line 201, in encode
chunks = list(chunks)
  File "/usr/lib64/python3.6/json/encoder.py", line 430, in _iterencode
yield from _iterencode_dict(o, _current_indent_level)
  File "/usr/lib64/python3.6/json/encoder.py", line 404, in _iterencode_dict
yield from chunks
  File "/usr/lib64/python3.6/json/encoder.py", line 437, in _iterencode
o = _default(o)
  File "/usr/lib64/python3.6/json/encoder.py", line 180, in default
o.__class__.__name__)
TypeError: Object of type 'dict_keys' is not JSON serializable

it seems to me that ceph mgr is complie/written for python 3.6 but
balancer plugin is written for python 2.7... this might be related

https://github.com/ceph/ceph/pull/21446

this might be opensuse building ceph package issue

chris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS - How to handle "loaded dup inode" errors

2018-07-05 Thread Dennis Kramer (DT)

Hi,

I'm getting a bunch of "loaded dup inode" errors in the MDS logs.
How can this be fixed?

logs:
2018-07-05 10:20:05.591948 mds.mds05 [ERR] loaded dup inode 0x1991921 
[2,head] v160 at , but inode 0x1991921.head v146 already 
exists at 




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS MDS server stuck in "resolve" state

2018-07-05 Thread Dennis Kramer (DBS)
Hi,


On Thu, 2018-07-05 at 09:55 +0800, Yan, Zheng wrote:
> On Wed, Jul 4, 2018 at 7:02 PM Dennis Kramer (DBS) 
> wrote:
> > 
> > 
> > Hi,
> > 
> > I have managed to get cephfs mds online again...for a while.
> > 
> > These topics covers more or less my symptoms and helped me get it
> > up
> > and running again:
> > - https://www.spinics.net/lists/ceph-users/msg45696.h
> > tml
> > - http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-December
> > /
> > 023133.html
> > 
> > After some time it all goes down again and keeps in a loop trying
> > to
> > get into an "active" state then after a while it crashes again.
> > Logs from the MDS right before it crashes:
> > > 
> > > 0> 2018-07-04 11:34:54.657595 7f50f1c29700 -1 /build/ceph-
> > 12.2.5/src/mds/MDCache.cc: In function 'void
> > > 
> > > MDCache::add_inode(CInode*)' thread 7f50f1c29700 time 2018-07-04
> > 11:34:54.638462
> > > 
> > > /build/ceph-12.2.5/src/mds/MDCache.cc: 262: FAILED assert(!p)
> > Cluster logging:
> > 2018-07-04 12:50:04.741625 mds.mds01 [ERR] dir 0x198a246 object
> > missing on disk; some files may be lost ( > 2018-07-04 12:50:16.352824 mon.mon01 [ERR] MDS health message
> > (mds.0):
> > Metadata damage detected
> > 2018-07-04 12:50:16.480045 mon.mon01 [ERR] Health check failed: 1
> > MDSs
> > report damaged metadata (MDS_DAMAGE)
> > 2018-07-04 12:53:36.194056 mds.mds01 [ERR] loaded dup inode
> > 0x1989e52 [2,head] v1104251 at , but inode
> > 0x1989e52.head
> > v37 already exists at 
> > 
> > CephFS won't stay up for long, after some time it crashes and I
> > need to
> > reset the fs to get it back again.
> > 
> > I'm at a loss here.
> I guess you did reset mds journal.  have you run complete recovery
> sequence?
> 
> cephfs-data-scan init
> cephfs-data-scan scan_extents 
> cephfs-data-scan scan_inodes 
> cephfs-data-scan scan_links
> 
> see http://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/


cephfs-data-scan init
Inode 0x0x1 already exists, skipping create.  Use
--force-init to overwrite the existing object.
Inode 0x0x100 already
exists, skipping create.  Use --force-init to overwrite the existing
object.

Is it safe to run --force-init or will it break my FS?
Further more, do I need to bring my MDS offline while doing the recovery 
process of:

cephfs-data-scan init
cephfs-data-scan scan_extents  
> > 
> > 
> > On Wed, 2018-06-27 at 21:38 +0800, Yan, Zheng wrote:
> > > 
> > > On Wed, Jun 27, 2018 at 6:16 PM Dennis Kramer (DT)  > > .nl>
> > > wrote:
> > > > 
> > > > 
> > > > 
> > > > Hi,
> > > > 
> > > > Currently i'm running Ceph Luminous 12.2.5.
> > > > 
> > > > This morning I tried running Multi MDS with:
> > > > ceph fs set  max_mds 2
> > > > 
> > > > I have 5 MDS servers. After running above command,
> > > > I had 2 active MDSs, 2 standby-active and 1 standby.
> > > > 
> > > > And after trying a failover on one
> > > > of the active MDSs, a standby-active did a replay but crashed
> > > > (laggy or
> > > > crashed). Memory and CPU went sky high on the MDS and was
> > > > unresponsive
> > > > after some time. I ended up with the one active MDS but got
> > > > stuck
> > > > with a
> > > > degraded filesystem and warning messages about MDS behind on
> > > > trimming.
> > > > 
> > > > I never got any additional MDS active since then. I tried
> > > > restarting the
> > > > last active MDS (because the filesystem was becoming
> > > > unresponsive
> > > > and had
> > > > a load of slow requets) and it never got passed replay ->
> > > > resolve.
> > > > My MDS
> > > > cluster still isn't active... :(
> > > What is the 'ceph -w' ouput? If you have enabled multi-active
> > > mds.
> > > All
> > > mds ranks need to enter the resolve 'state' before they can
> > > continue
> > > to recover.
> > > 
> > > 
> > > 
> > > > 
> > > > 
> > > > 
> > > > What is the "resolve" state? I have never seen that before pre-
> > > > Luminous.
> > > > Debug on 20 doesn't give me much.
> > > > 
> > > > Also tried removing the Multi MDS setup, but my CephFS cluster
> > > > won't go
> > > > active. How can I get my CephFS up and running again in an
> > > > active
> > > > state.
> > > > 
> > > > Please help.
> > > > 
> > > > 
> > > > ___
> > > > ceph-users mailing list
> > > > ceph-users@lists.ceph.com
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

signature.asc
Description: This is a digitally signed message part
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph plugin balancer error

2018-07-05 Thread Nathan Cutler

Hi Chris:

I suggest you raise your openSUSE Ceph-related questions on the openSUSE 
Ceph mailing list instead of ceph-users. For info on how to join, go to


https://en.opensuse.org/openSUSE:Ceph#Communication

The version of Ceph currently shipping in Leap 15.0 is built against 
Python 3 and this, as you found, exposes python2-specific code in the 
Ceph codebase.


We might reconsider this and push a Python 2 build to Leap 15.0 - let's 
discuss it on opensuse-ceph.


Thanks,
Nathan

On 07/05/2018 09:12 AM, Chris Hsiang wrote:

Hi,

I am running test on ceph mimic  13.0.2.1874+ge31585919b-lp150.1.2 using 
openSUSE-Leap-15.0


when I ran "ceph balancer status", it errors out.

g1:/var/log/ceph # ceph balancer status
Error EIO: Module 'balancer' has experienced an error and cannot handle 
commands: 'dict' object has no attribute 'iteritems'


what config need to be done in order to get it work?

Chris


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Nathan Cutler
Software Engineer Distributed Storage
SUSE LINUX, s.r.o.
Tel.: +420 284 084 037
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph plugin balancer error

2018-07-05 Thread Nathan Cutler

Update: opened http://tracker.ceph.com/issues/24779 to track this bug,
and am in the process of fixing it.

The fix will make its way into a future mimic point release.

Thanks, Chris, for bringing the issue to my attention!

Nathan

On 07/05/2018 11:27 AM, Nathan Cutler wrote:

Hi Chris:

I suggest you raise your openSUSE Ceph-related questions on the openSUSE 
Ceph mailing list instead of ceph-users. For info on how to join, go to


https://en.opensuse.org/openSUSE:Ceph#Communication

The version of Ceph currently shipping in Leap 15.0 is built against 
Python 3 and this, as you found, exposes python2-specific code in the 
Ceph codebase.


We might reconsider this and push a Python 2 build to Leap 15.0 - let's 
discuss it on opensuse-ceph.


Thanks,
Nathan

On 07/05/2018 09:12 AM, Chris Hsiang wrote:

Hi,

I am running test on ceph mimic  13.0.2.1874+ge31585919b-lp150.1.2 
using openSUSE-Leap-15.0


when I ran "ceph balancer status", it errors out.

g1:/var/log/ceph # ceph balancer status
Error EIO: Module 'balancer' has experienced an error and cannot 
handle commands: 'dict' object has no attribute 'iteritems'


what config need to be done in order to get it work?

Chris


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





--
Nathan Cutler
Software Engineer Distributed Storage
SUSE LINUX, s.r.o.
Tel.: +420 284 084 037
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors

2018-07-05 Thread Dennis Kramer (DBS)
Hi list,

I have a serious problem now... I think.

One of my users just informed me that a file he created (.doc file) has
a different content then before. It looks like the file's inode is
completely wrong and points to the wrong object. I myself have found
another file with the same symptoms. I'm afraid my (production) FS is
corrupt now, unless there is a possibility to fix the inodes.

Timeline of what happend:

Last week I upgraded our Ceph Jewel to Luminous. 
This went without any problem.

I already had 5 MDS available and went with the Multi-MDS feature and
enabled it. The seemed to work okay, but after a while my MDS went
beserk and went flapping (crashed -> replay -> rejoin -> crashed)

The only way to fix this and get the FS back online was the disaster
recovery procedure:

cephfs-journal-tool event recover_dentries summary
ceph fs set cephfs cluster_down true
cephfs-table-tool all reset session
cephfs-table-tool all reset inode
cephfs-journal-tool --rank=cephfs:0 journal reset
ceph mds fail 0
ceph fs reset cephfs --yes-i-really-mean-it

Restarted the MDS and I was back online. Shortly after I was getting a
lot of "loaded dup inode". In the meanwhile the MDS kept crashing. It
looks like it had trouble creating new inodes. Right before the crash
it mostly complained something like:

    -2> 2018-07-05 05:05:01.614290 7f8f8574b700  4 mds.0.server
handle_client_request client_request(client.324932014:1434 create
#0x1360346/pyfiles.txt 2018-07-05 05:05:01.607458 caller_uid=0,
caller_gid=0{}) v2
-1> 2018-07-05 05:05:01.614320 7f8f7e73d700  5 mds.0.log
_submit_thread 24100753876035~1070 : EOpen [metablob 0x1360346, 1
dirs], 1 open files
 0> 2018-07-05 05:05:01.661155 7f8f8574b700 -1 /build/ceph-
12.2.5/src/mds/MDCache.cc: In function 'void
MDCache::add_inode(CInode*)' thread 7f8f8574b700 time 2018-07-05
05:05:01.615123
/build/ceph-12.2.5/src/mds/MDCache.cc: 262: FAILED assert(!p)

I also tried to counter the create inode crash by doing the following:

cephfs-journal-tool event recover_dentries 
cephfs-journal-tool journal reset
cephfs-table-tool all reset session
cephfs-table-tool all reset inode
cephfs-table-tool all take_inos 10

I'm worried that my FS is corrupt because files are not linked
correctly and have different content then they should.

Please help.

On Thu, 2018-07-05 at 10:35 +0200, Dennis Kramer (DT) wrote:
> Hi,
> 
> I'm getting a bunch of "loaded dup inode" errors in the MDS logs.
> How can this be fixed?
> 
> logs:
> 2018-07-05 10:20:05.591948 mds.mds05 [ERR] loaded dup inode 0x1991921 
> [2,head] v160 at , but inode 0x1991921.head v146 already 
> exists at 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

signature.asc
Description: This is a digitally signed message part
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] corrupt OSD: BlueFS.cc: 828: FAILED assert

2018-07-05 Thread Jake Grimmett
Dear All,

I have a Mimic (13.2.0) cluster, which, due to a bad disk controller,
corrupted three Bluestore OSD's on one node.

Unfortunately these three OSD's crash when they try to start.

systemctl start ceph-osd@193
(snip)
/BlueFS.cc: 828: FAILED assert(r != q->second->file_map.end())

Full log here: http://p.ip.fi/yFYn

"ceph-bluestore-tool repair" also crashes, with a similar error in BlueFS.cc

# ceph-bluestore-tool repair --dev /dev/sdc2 --path
/var/lib/ceph/osd/ceph-193
(snip)
/BlueFS.cc: 828: FAILED assert(r != q->second->file_map.end())

Full log here: http://p.ip.fi/l_Q_

This command works OK:

# ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-193
inferring bluefs devices from bluestore path
{
"/var/lib/ceph/osd/ceph-193/block": {
"osd_uuid": "90b25336-9932-4e0b-a16b-51159568c398",
"size": 8001457295360,
"btime": "2017-12-08 15:46:40.034495",
"description": "main",
"bluefs": "1",
"ceph_fsid": "f035ee98-abfd-4496-b903-a403b29c828f",
"kv_backend": "rocksdb",
"magic": "ceph osd volume v026",
"mkfs_done": "yes",
"ready": "ready",
"whoami": "193"
}
}

# lsblk | grep sdc
sdc   8:32   0   7.3T  0 disk
├─sdc18:33   0   100M  0 part  /var/lib/ceph/osd/ceph-193
└─sdc28:34   0   7.3T  0 part

Since the OSD's failed, the Cluster has rebalanced, though I still have
ceph HEALTH_ERR:
95 scrub errors; Possible data damage: 11 pgs inconsistent

Manual scrubs are not started by the OSD demons (reported elsewhere, see
 [ceph-users] "ceph pg scrub" does not start)

Looking at the old logs, I see ~3500 entries in the logs of the bad
OSDs, all similar to:

-9> 2018-07-04 14:42:34.744 7f9ef0bbb1c0  2 rocksdb:
[/root/ceph-build/ceph-13.2.0/src/rocksdb/db/version_set.cc:1330] Unable
to load table properties for file 43530 --- Corruption: bad block
contents���5b

There are a much smaller number of crc errors, similar to :

2> 2018-07-02 12:58:07.702 7fd3649eb1c0 -1
bluestore(/var/lib/ceph/osd/ceph-425) _verify_csum bad crc32c/0x1000
checksum at blob offset 0x0, got 0xff625379, expected 0x75b558bc, device
location [0xf5a66e~1000], logical extent 0x0~1000, object
#-1:2c691ffb:::osdmap.176500:0#

I'm inclined to wipe these three OSD's and start again, but am happy to
try suggestions to repair.

thanks for any suggestions,

Jake
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] jemalloc / Bluestore

2018-07-05 Thread Uwe Sauter
Hi all,

is using jemalloc still recommended for Ceph?

There are multiple sites (e.g. 
https://ceph.com/geen-categorie/the-ceph-and-tcmalloc-performance-story/) from 
2015 where jemalloc
is praised for higher performance but I found a bug report that Bluestore 
crashes when used with jemalloc.

Regards,

Uwe
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] corrupt OSD: BlueFS.cc: 828: FAILED assert

2018-07-05 Thread Igor Fedotov

Hi Jake,

IMO it doesn't make sense to recover from this drive/data as the damage 
coverage looks pretty wide.


By modifying BlueFS code you can bypass that specific assertion but most 
probably BlueFS and  other BlueStore stuff are pretty inconsistent and 
most probably are unrecoverable at the moment. Given that you have valid 
replicated data it's much simpler just to start these OSDs over.



Thanks,

Igor


On 7/5/2018 3:58 PM, Jake Grimmett wrote:

Dear All,

I have a Mimic (13.2.0) cluster, which, due to a bad disk controller,
corrupted three Bluestore OSD's on one node.

Unfortunately these three OSD's crash when they try to start.

systemctl start ceph-osd@193
(snip)
/BlueFS.cc: 828: FAILED assert(r != q->second->file_map.end())

Full log here: http://p.ip.fi/yFYn

"ceph-bluestore-tool repair" also crashes, with a similar error in BlueFS.cc

# ceph-bluestore-tool repair --dev /dev/sdc2 --path
/var/lib/ceph/osd/ceph-193
(snip)
/BlueFS.cc: 828: FAILED assert(r != q->second->file_map.end())

Full log here: http://p.ip.fi/l_Q_

This command works OK:

# ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-193
inferring bluefs devices from bluestore path
{
 "/var/lib/ceph/osd/ceph-193/block": {
 "osd_uuid": "90b25336-9932-4e0b-a16b-51159568c398",
 "size": 8001457295360,
 "btime": "2017-12-08 15:46:40.034495",
 "description": "main",
 "bluefs": "1",
 "ceph_fsid": "f035ee98-abfd-4496-b903-a403b29c828f",
 "kv_backend": "rocksdb",
 "magic": "ceph osd volume v026",
 "mkfs_done": "yes",
 "ready": "ready",
 "whoami": "193"
 }
}

# lsblk | grep sdc
sdc   8:32   0   7.3T  0 disk
├─sdc18:33   0   100M  0 part  /var/lib/ceph/osd/ceph-193
└─sdc28:34   0   7.3T  0 part

Since the OSD's failed, the Cluster has rebalanced, though I still have
ceph HEALTH_ERR:
95 scrub errors; Possible data damage: 11 pgs inconsistent

Manual scrubs are not started by the OSD demons (reported elsewhere, see
  [ceph-users] "ceph pg scrub" does not start)

Looking at the old logs, I see ~3500 entries in the logs of the bad
OSDs, all similar to:

 -9> 2018-07-04 14:42:34.744 7f9ef0bbb1c0  2 rocksdb:
[/root/ceph-build/ceph-13.2.0/src/rocksdb/db/version_set.cc:1330] Unable
to load table properties for file 43530 --- Corruption: bad block
contents���5b

There are a much smaller number of crc errors, similar to :

2> 2018-07-02 12:58:07.702 7fd3649eb1c0 -1
bluestore(/var/lib/ceph/osd/ceph-425) _verify_csum bad crc32c/0x1000
checksum at blob offset 0x0, got 0xff625379, expected 0x75b558bc, device
location [0xf5a66e~1000], logical extent 0x0~1000, object
#-1:2c691ffb:::osdmap.176500:0#

I'm inclined to wipe these three OSD's and start again, but am happy to
try suggestions to repair.

thanks for any suggestions,

Jake
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] jemalloc / Bluestore

2018-07-05 Thread Igor Fedotov

Hi Uwe,

AFAIK jemalloc isn't recommended for use with BlueStore anymore.

tcmalloc is the right way so far.


Thanks,

Igor


On 7/5/2018 4:08 PM, Uwe Sauter wrote:

Hi all,

is using jemalloc still recommended for Ceph?

There are multiple sites (e.g. 
https://ceph.com/geen-categorie/the-ceph-and-tcmalloc-performance-story/) from 
2015 where jemalloc
is praised for higher performance but I found a bug report that Bluestore 
crashes when used with jemalloc.

Regards,

Uwe
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGW User Stats Mismatch

2018-07-05 Thread Ryan Leimenstoll
Hi all, 

We currently have a Ceph Luminous 12.2.5 cluster running, among other 
functions, an RGW service for users in our organization. The cluster has been 
upgraded through a few major versions, at least farther back than Hammer. For 
some time, we were bitten by the RGW user quota sync issue that we believe was 
fixed in [0]. With that, we have a number of users who’s reported ‘total_bytes' 
are higher than their actual data holdings in the cluster. We are taking 
account of objects that might have been written into other buckets when we 
calculate this and running with radosgw-admin user stats —uid=USER —sync-stats. 

While we can no longer replicate the issue since that patch, is there a 
suggested path forward to rectify the existing user stats that may have been 
skewed by this bug before the patched release?

[0] http://tracker.ceph.com/issues/14507

Thanks!
Ryan Leimenstoll
rleim...@umiacs.umd .edu
University of Maryland Institute for Advanced Computer Studies


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS - How to handle "loaded dup inode" errors

2018-07-05 Thread John Spray
On Thu, Jul 5, 2018 at 1:42 PM Dennis Kramer (DBS)  wrote:
>
> Hi list,
>
> I have a serious problem now... I think.
>
> One of my users just informed me that a file he created (.doc file) has
> a different content then before. It looks like the file's inode is
> completely wrong and points to the wrong object. I myself have found
> another file with the same symptoms. I'm afraid my (production) FS is
> corrupt now, unless there is a possibility to fix the inodes.

You can probably get back to a state with some valid metadata, but it
might not necessarily be the metadata the user was expecting (e.g. if
two files are claiming the same inode number, one of them's is
probably going to get deleted).

> Timeline of what happend:
>
> Last week I upgraded our Ceph Jewel to Luminous.
> This went without any problem.
>
> I already had 5 MDS available and went with the Multi-MDS feature and
> enabled it. The seemed to work okay, but after a while my MDS went
> beserk and went flapping (crashed -> replay -> rejoin -> crashed)
>
> The only way to fix this and get the FS back online was the disaster
> recovery procedure:
>
> cephfs-journal-tool event recover_dentries summary
> ceph fs set cephfs cluster_down true
> cephfs-table-tool all reset session
> cephfs-table-tool all reset inode
> cephfs-journal-tool --rank=cephfs:0 journal reset
> ceph mds fail 0
> ceph fs reset cephfs --yes-i-really-mean-it

My concern with this procedure is that the recover_dentries and
journal reset only happened on rank 0, whereas the other 4 MDS ranks
would have retained lots of content in their journals.  I wonder if we
should be adding some more multi-mds aware checks to these tools, to
warn the user when they're only acting on particular ranks (a
reasonable person might assume that recover_dentries with no args is
operating on all ranks, not just 0).  Created
http://tracker.ceph.com/issues/24780 to track improving the default
behaviour.

> Restarted the MDS and I was back online. Shortly after I was getting a
> lot of "loaded dup inode". In the meanwhile the MDS kept crashing. It
> looks like it had trouble creating new inodes. Right before the crash
> it mostly complained something like:
>
> -2> 2018-07-05 05:05:01.614290 7f8f8574b700  4 mds.0.server
> handle_client_request client_request(client.324932014:1434 create
> #0x1360346/pyfiles.txt 2018-07-05 05:05:01.607458 caller_uid=0,
> caller_gid=0{}) v2
> -1> 2018-07-05 05:05:01.614320 7f8f7e73d700  5 mds.0.log
> _submit_thread 24100753876035~1070 : EOpen [metablob 0x1360346, 1
> dirs], 1 open files
>  0> 2018-07-05 05:05:01.661155 7f8f8574b700 -1 /build/ceph-
> 12.2.5/src/mds/MDCache.cc: In function 'void
> MDCache::add_inode(CInode*)' thread 7f8f8574b700 time 2018-07-05
> 05:05:01.615123
> /build/ceph-12.2.5/src/mds/MDCache.cc: 262: FAILED assert(!p)
>
> I also tried to counter the create inode crash by doing the following:
>
> cephfs-journal-tool event recover_dentries
> cephfs-journal-tool journal reset
> cephfs-table-tool all reset session
> cephfs-table-tool all reset inode
> cephfs-table-tool all take_inos 10

This procedure is recovering some metadata from the journal into the
main tree, then resetting everything, but duplicate inodes are
happening when the main tree has multiple dentries containing inodes
using the same inode number.

What you need is something that scans through all the metadata,
notices which entries point to the a duplicate, and snips out those
dentries.  I'm not quite up to date on the latest CephFS forward scrub
bits, so hopefully someone else can chime in to comment on whether we
have the tooling for this already.

John

>
> I'm worried that my FS is corrupt because files are not linked
> correctly and have different content then they should.
>
> Please help.
>
> On Thu, 2018-07-05 at 10:35 +0200, Dennis Kramer (DT) wrote:
> > Hi,
> >
> > I'm getting a bunch of "loaded dup inode" errors in the MDS logs.
> > How can this be fixed?
> >
> > logs:
> > 2018-07-05 10:20:05.591948 mds.mds05 [ERR] loaded dup inode 0x1991921
> > [2,head] v160 at , but inode 0x1991921.head v146 already
> > exists at 
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] corrupt OSD: BlueFS.cc: 828: FAILED assert

2018-07-05 Thread Jake Grimmett
Hi Igor,

Many thanks for the quick reply.

Your advice concurs with my own thoughts, given the damage, probably
safest to wipe the OSD's and start over.

thanks again,

Jake


On 05/07/18 14:28, Igor Fedotov wrote:
> Hi Jake,
> 
> IMO it doesn't make sense to recover from this drive/data as the damage
> coverage looks pretty wide.
> 
> By modifying BlueFS code you can bypass that specific assertion but most
> probably BlueFS and  other BlueStore stuff are pretty inconsistent and
> most probably are unrecoverable at the moment. Given that you have valid
> replicated data it's much simpler just to start these OSDs over.
> 
> 
> Thanks,
> 
> Igor
> 
> 
> On 7/5/2018 3:58 PM, Jake Grimmett wrote:
>> Dear All,
>>
>> I have a Mimic (13.2.0) cluster, which, due to a bad disk controller,
>> corrupted three Bluestore OSD's on one node.
>>
>> Unfortunately these three OSD's crash when they try to start.
>>
>> systemctl start ceph-osd@193
>> (snip)
>> /BlueFS.cc: 828: FAILED assert(r != q->second->file_map.end())
>>
>> Full log here: http://p.ip.fi/yFYn
>>
>> "ceph-bluestore-tool repair" also crashes, with a similar error in
>> BlueFS.cc
>>
>> # ceph-bluestore-tool repair --dev /dev/sdc2 --path
>> /var/lib/ceph/osd/ceph-193
>> (snip)
>> /BlueFS.cc: 828: FAILED assert(r != q->second->file_map.end())
>>
>> Full log here: http://p.ip.fi/l_Q_
>>
>> This command works OK:
>>
>> # ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-193
>> inferring bluefs devices from bluestore path
>> {
>>  "/var/lib/ceph/osd/ceph-193/block": {
>>  "osd_uuid": "90b25336-9932-4e0b-a16b-51159568c398",
>>  "size": 8001457295360,
>>  "btime": "2017-12-08 15:46:40.034495",
>>  "description": "main",
>>  "bluefs": "1",
>>  "ceph_fsid": "f035ee98-abfd-4496-b903-a403b29c828f",
>>  "kv_backend": "rocksdb",
>>  "magic": "ceph osd volume v026",
>>  "mkfs_done": "yes",
>>  "ready": "ready",
>>  "whoami": "193"
>>  }
>> }
>>
>> # lsblk | grep sdc
>> sdc   8:32   0   7.3T  0 disk
>> ├─sdc1    8:33   0   100M  0 part  /var/lib/ceph/osd/ceph-193
>> └─sdc2    8:34   0   7.3T  0 part
>>
>> Since the OSD's failed, the Cluster has rebalanced, though I still have
>> ceph HEALTH_ERR:
>> 95 scrub errors; Possible data damage: 11 pgs inconsistent
>>
>> Manual scrubs are not started by the OSD demons (reported elsewhere, see
>>   [ceph-users] "ceph pg scrub" does not start)
>>
>> Looking at the old logs, I see ~3500 entries in the logs of the bad
>> OSDs, all similar to:
>>
>>  -9> 2018-07-04 14:42:34.744 7f9ef0bbb1c0  2 rocksdb:
>> [/root/ceph-build/ceph-13.2.0/src/rocksdb/db/version_set.cc:1330] Unable
>> to load table properties for file 43530 --- Corruption: bad block
>> contents���5b
>>
>> There are a much smaller number of crc errors, similar to :
>>
>> 2> 2018-07-02 12:58:07.702 7fd3649eb1c0 -1
>> bluestore(/var/lib/ceph/osd/ceph-425) _verify_csum bad crc32c/0x1000
>> checksum at blob offset 0x0, got 0xff625379, expected 0x75b558bc, device
>> location [0xf5a66e~1000], logical extent 0x0~1000, object
>> #-1:2c691ffb:::osdmap.176500:0#
>>
>> I'm inclined to wipe these three OSD's and start again, but am happy to
>> try suggestions to repair.
>>
>> thanks for any suggestions,
>>
>> Jake
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance tuning for SAN SSD config

2018-07-05 Thread Matthew Stroud
Bump. I’m hoping I can get people more knowledgeable than me to take a look.

From: ceph-users  on behalf of Matthew 
Stroud 
Date: Friday, June 29, 2018 at 10:31 AM
To: ceph-users 
Subject: [ceph-users] Performance tuning for SAN SSD config

We back some of our ceph clusters with SAN SSD disk, particularly VSP G/F and 
Purestorage. I’m curious what are some settings we should look into modifying 
to take advantage of our SAN arrays. We had to manually set the class for the 
luns to SSD class which was a big improvement. However we still see situations 
where we get slow requests and the underlying disks and network are 
underutilized.

More info about our setup. We are running centos 7 with Luminous as our ceph 
release. We have 4 osd nodes that have 5x2TB disks each and they are setup as 
bluestore. Our ceph.conf is attached with some information removed for security 
reasons.

Thanks ahead of time.

Thanks,
Matthew Stroud




CONFIDENTIALITY NOTICE: This message is intended only for the use and review of 
the individual or entity to which it is addressed and may contain information 
that is privileged and confidential. If the reader of this message is not the 
intended recipient, or the employee or agent responsible for delivering the 
message solely to the intended recipient, you are hereby notified that any 
dissemination, distribution or copying of this communication is strictly 
prohibited. If you have received this communication in error, please notify 
sender immediately by telephone or return email. Thank you.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] jemalloc / Bluestore

2018-07-05 Thread Mark Nelson

Hi Uwe,


As luck would have it we were just looking at memory allocators again 
and ran some quick RBD and RGW tests that stress memory allocation:



https://drive.google.com/uc?export=download&id=1VlWvEDSzaG7fE4tnYfxYtzeJ8mwx4DFg


The gist of it is that tcmalloc looks like it's doing pretty well 
relative to the version of jemalloc and libc malloc tested (The jemalloc 
version here is pretty old though).  You are also correct that there 
have been reports of crashes with jemalloc, potentially related to 
rocksdb.  Right now it looks like our decision to stick with tcmalloc is 
still valid.  I wouldn't suggest switching unless you can find evidence 
that tcmalloc is behaving worse than the others (and please let me know 
if you do!).


Thanks,

Mark


On 07/05/2018 08:08 AM, Uwe Sauter wrote:

Hi all,

is using jemalloc still recommended for Ceph?

There are multiple sites (e.g. 
https://ceph.com/geen-categorie/the-ceph-and-tcmalloc-performance-story/) from 
2015 where jemalloc
is praised for higher performance but I found a bug report that Bluestore 
crashes when used with jemalloc.

Regards,

Uwe
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] jemalloc / Bluestore

2018-07-05 Thread Uwe Sauter

Ah, thanks…

I'm currently trying to diagnose a performace regression that occurs with the Ubuntu 4.15 kernel (on a Proxmox system) 
and thought that jemalloc, given the old reports, could help with that. But than I ran into that bug report.


I'll take from your info that I'm gonna stick to tcmalloc. You know, so much to 
test and benchmark, so little time…


Regards,

Uwe

Am 05.07.2018 um 19:08 schrieb Mark Nelson:

Hi Uwe,


As luck would have it we were just looking at memory allocators again and ran some quick RBD and RGW tests that stress 
memory allocation:



https://drive.google.com/uc?export=download&id=1VlWvEDSzaG7fE4tnYfxYtzeJ8mwx4DFg


The gist of it is that tcmalloc looks like it's doing pretty well relative to the version of jemalloc and libc malloc 
tested (The jemalloc version here is pretty old though).  You are also correct that there have been reports of crashes 
with jemalloc, potentially related to rocksdb.  Right now it looks like our decision to stick with tcmalloc is still 
valid.  I wouldn't suggest switching unless you can find evidence that tcmalloc is behaving worse than the others (and 
please let me know if you do!).


Thanks,

Mark


On 07/05/2018 08:08 AM, Uwe Sauter wrote:

Hi all,

is using jemalloc still recommended for Ceph?

There are multiple sites (e.g. https://ceph.com/geen-categorie/the-ceph-and-tcmalloc-performance-story/) from 2015 
where jemalloc

is praised for higher performance but I found a bug report that Bluestore 
crashes when used with jemalloc.

Regards,

Uwe
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pool has many more objects per pg than average

2018-07-05 Thread Stefan Kooman
Quoting Brett Chancellor (bchancel...@salesforce.com):
> The error will go away once you start storing data in the other pools. Or,
> you could simply silence the message with mon_pg_warn_max_object_skew = 0

Ran into this issue myself (again). Note to self: You need to restart the 
_active_ MGR
to get this option active. That's not written anywhere, yet. I'll see if
I can make a doc PR to fix that.

Gr. Stefan

P.s. this option really does more harm than good IMHO. Mixed clusters
with cephfs / RGW mixed with RBD pools, and or newly created pools,
will run into this issue sooner or later.


-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance tuning for SAN SSD config

2018-07-05 Thread Steffen Winther Sørensen


> On 5 Jul 2018, at 16.51, Matthew Stroud  wrote:
> 
> Bump. I’m hoping I can get people more knowledgeable than me to take a look.
> We back some of our ceph clusters with SAN SSD disk, particularly VSP G/F and 
> Purestorage. I’m curious what are some settings we should look into modifying 
> to take advantage of our SAN arrays. We
Trust that you already looked into tunning the scsi layer through a proper 
tuned profile like maybe enterprise (nobarrier, io scheduler none/deadline 
etc.) to push your array the most.

> had to manually set the class for the luns to SSD class which was a big 
> improvement. However we still see situations where we get slow requests and 
> the underlying disks and network are underutilized.
>  
> More info about our setup. We are running centos 7 with Luminous as our ceph 
> release. We have 4 osd nodes that have 5x2TB disks each and they are setup as 
> bluestore. Our ceph.conf is attached with some information removed for 
> security reasons.
>  
> Thanks ahead of time.
>  
> Thanks,
> Matthew Stroud
>  
> 
> 
> CONFIDENTIALITY NOTICE: This message is intended only for the use and review 
> of the individual or entity to which it is addressed and may contain 
> information that is privileged and confidential. If the reader of this 
> message is not the intended recipient, or the employee or agent responsible 
> for delivering the message solely to the intended recipient, you are hereby 
> notified that any dissemination, distribution or copying of this 
> communication is strictly prohibited. If you have received this communication 
> in error, please notify sender immediately by telephone or return email. 
> Thank you.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com