Re: [ceph-users] mds fail ing to start 14.2.2

2019-10-11 Thread Yan, Zheng
On Sat, Oct 12, 2019 at 1:10 AM Kenneth Waegeman 
wrote:

> Hi all,
>
> After solving some pg inconsistency problems, my fs is still in
> trouble.  my mds's are crashing with this error:
>
>
> > -5> 2019-10-11 19:02:55.375 7f2d39f10700  1 mds.1.564276 rejoin_start
> > -4> 2019-10-11 19:02:55.385 7f2d3d717700  5 mds.beacon.mds01
> > received beacon reply up:rejoin seq 5 rtt 1.01
> > -3> 2019-10-11 19:02:55.495 7f2d39f10700  1 mds.1.564276
> > rejoin_joint_start
> > -2> 2019-10-11 19:02:55.505 7f2d39f10700  5 mds.mds01
> > handle_mds_map old map epoch 564279 <= 564279, discarding
> > -1> 2019-10-11 19:02:55.695 7f2d33f04700 -1
> >
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/mds/mdstyp
> > es.h: In function 'static void
> > dentry_key_t::decode_helper(std::string_view, std::string&,
> > snapid_t&)' thread 7f2d33f04700 time 2019-10-11 19:02:55.703343
> >
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/mds/mdstypes.h:
>
> > 1229: FAILED ceph_assert(i != string::npos
> > )
> >
> >  ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be)
> > nautilus (stable)
> >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > const*)+0x14a) [0x7f2d43393046]
> >  2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char
> > const*, char const*, ...)+0) [0x7f2d43393214]
> >  3: (CDir::_omap_fetched(ceph::buffer::v14_2_0::list&,
> > std::map > std::less, std::allocator > ceph::buffer::v14_2_0::list> > >&, bool, int)+0xa68) [0x556a17ec
> > baa8]
> >  4: (C_IO_Dir_OMAP_Fetched::finish(int)+0x54) [0x556a17ee0034]
> >  5: (MDSContext::complete(int)+0x70) [0x556a17f5e710]
> >  6: (MDSIOContextBase::complete(int)+0x16b) [0x556a17f5e9ab]
> >  7: (Finisher::finisher_thread_entry()+0x156) [0x7f2d433d8386]
> >  8: (()+0x7dd5) [0x7f2d41262dd5]
> >  9: (clone()+0x6d) [0x7f2d3ff1302d]
> >
> >  0> 2019-10-11 19:02:55.695 7f2d33f04700 -1 *** Caught signal
> > (Aborted) **
> >  in thread 7f2d33f04700 thread_name:fn_anonymous
> >
> >  ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be)
> > nautilus (stable)
> >  1: (()+0xf5d0) [0x7f2d4126a5d0]
> >  2: (gsignal()+0x37) [0x7f2d3fe4b2c7]
> >  3: (abort()+0x148) [0x7f2d3fe4c9b8]
> >  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > const*)+0x199) [0x7f2d43393095]
> >  5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char
> > const*, char const*, ...)+0) [0x7f2d43393214]
> >  6: (CDir::_omap_fetched(ceph::buffer::v14_2_0::list&,
> > std::map > std::less, std::allocator > ceph::buffer::v14_2_0::list> > >&, bool, int)+0xa68) [0x556a17ec
> > baa8]
> >  7: (C_IO_Dir_OMAP_Fetched::finish(int)+0x54) [0x556a17ee0034]
> >  8: (MDSContext::complete(int)+0x70) [0x556a17f5e710]
> >  9: (MDSIOContextBase::complete(int)+0x16b) [0x556a17f5e9ab]
> >  10: (Finisher::finisher_thread_entry()+0x156) [0x7f2d433d8386]
> >  11: (()+0x7dd5) [0x7f2d41262dd5]
> >  12: (clone()+0x6d) [0x7f2d3ff1302d]
> >  NOTE: a copy of the executable, or `objdump -rdS ` is
> > needed to interpret this.
> >
> > [root@mds02 ~]# ceph -s
> >   cluster:
> > id: 92bfcf0a-1d39-43b3-b60f-44f01b630e47
> > health: HEALTH_WARN
> > 1 filesystem is degraded
> > insufficient standby MDS daemons available
> > 1 MDSs behind on trimming
> > 1 large omap objects
> >
> >   services:
> > mon: 3 daemons, quorum mds01,mds02,mds03 (age 4d)
> > mgr: mds02(active, since 3w), standbys: mds01, mds03
> > mds: ceph_fs:2/2 {0=mds02=up:rejoin,1=mds01=up:rejoin(laggy or
> > crashed)}
> > osd: 535 osds: 533 up, 529 in
> >
> >   data:
> > pools:   3 pools, 3328 pgs
> > objects: 376.32M objects, 673 TiB
> > usage:   1.0 PiB used, 2.2 PiB / 3.2 PiB avail
> > pgs: 3315 active+clean
> >  12   active+clean+scrubbing+deep
> >  1active+clean+scrubbing
> >
> Someone an idea where to go from here ?☺
>
>
looks like omap for dirfrag is corrupted.  please check mds log (debug_mds
= 10) to find which omap is corrupted. Basically all omap keys of dirfrag
should be in format _.




> Thanks!
>
> K
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds servers in endless segfault loop

2019-10-11 Thread Pickett, Neale T
I have created an anonymized crash log at 
https://pastebin.ubuntu.com/p/YsVXQQTBCM/ in the hopes that it can help someone 
understand what's leading to our MDS outage.


Thanks in advance for any assistance.



From: Pickett, Neale T
Sent: Thursday, October 10, 2019 21:46
To: ceph-users@lists.ceph.com
Subject: mds servers in endless segfault loop


Hello, ceph-users.


Our mds servers keep segfaulting from a failed assertion, and for the first 
time I can't find anyone else who's posted about this problem. None of them are 
able to stay up, so our cephfs is down.


We recently had to truncate the journal log after an upgrade to nautilus, and 
now we have lots of dup inodes, failed to open inode, and badness: got (but i 
already had) messages in the recent event dump, if that's relevant. I don't 
know which parts of that are going to be the most relevant, but here are the 
last ten:


  -10> 2019-10-11 03:30:35.258 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1843c err -22/0
-9> 2019-10-11 03:30:35.260 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1843c err -22/0
-8> 2019-10-11 03:30:35.260 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1843d err -22/-22
-7> 2019-10-11 03:30:35.260 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1843e err -22/-22
-6> 2019-10-11 03:30:35.261 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1843f err -22/-22
-5> 2019-10-11 03:30:35.261 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1845a err -22/-22
-4> 2019-10-11 03:30:35.262 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1845e err -22/-22
-3> 2019-10-11 03:30:35.262 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1846f err -22/-22
-2> 2019-10-11 03:30:35.263 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a18470 err -22/-22
-1> 2019-10-11 03:30:35.273 7fd080a69700 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUILD/ceph-14.2.4/src/mds/CInode.cc:
 In function 'CDir* CInode::get_or_open_dirfrag(MDCache*, frag_t)' thread 
7fd080a69700 time 2019-10-11 03:30:35.273849


I'm happy to provide any other information that would help diagnose the issue. 
I don't have any guesses about what else would be helpful, though.


Thanks in advance for any help!



Neale Pickett 
A-4: Advanced Research in Cyber Systems
Los Alamos National Laboratory
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Pool statistics via API

2019-10-11 Thread Sinan Polat
Hi Ernesto,

Thanks for the information! I didn’t know about the existence of the REST 
Dashboard API. I will check that out, Thanks again!

Sinan

> Op 11 okt. 2019 om 21:06 heeft Ernesto Puerta  het 
> volgende geschreven:
> 
> Hi Sinan,
> 
> If it's in the Dashboard, it sure comes from the Dashboard REST API (which is 
> an API completely unrelated to the RESTful Module).
> 
> To check the Dashboard REST API, log in there and click on the top-right "?" 
> menu, and in the dropdown, click on "API". That will lead you to the 
> Swagger/OpenAPI spec of the Dashboard. You will likely want to explore the 
> "/pool" and "/block" endpoints. The API page will give you ready-to-use curl 
> commands (the only thing you'd need to renew, once expired, is the 
> authorization token).
> 
> Kind regards,
> 
> Ernesto Puerta
> He / Him / His
> Senior Software Engineer, Ceph
> Red Hat
> 
> 
> 
>> On Thu, Oct 10, 2019 at 2:16 PM Sinan Polat  wrote:
>> Hi,
>> 
>> Currently I am getting the pool statistics (especially USED/MAX AVAIL) via 
>> the command line:
>> ceph df -f json-pretty| jq '.pools[] | select(.name == "poolname") | 
>> .stats.max_avail'
>> ceph df -f json-pretty| jq '.pools[] | select(.name == "poolname") | 
>> .stats.bytes_used'
>> 
>> Command "ceph df" does not show the (total) size of the provisioned RBD 
>> images. It only shows the real usage.
>> 
>> I managed to get the total size of provisioned images using the Python rbd 
>> module https://docs.ceph.com/docs/master/rbd/api/librbdpy/
>> 
>> Using the same Python module I also would like to get the USED/MAX AVAIL per 
>> pool. That should be possible using rbd.RBD().pool_stats_get, but 
>> unfortunately my python-rbd version doesn't support that (running 12.2.8).
>> 
>> So I went ahead and enabled the dashboard to see if the data is present in 
>> the dashboard and it seems it is. Next step is to enable the restful module 
>> and access this information, right? But unfortunately the restful api 
>> doesn't provide this information.
>> 
>> My question is, how can I access the USED/MAX AVAIL information of a pool 
>> without using the ceph command line and without upgrading my python-rbd 
>> package?
>> 
>> Kind regards
>> Sinan Polat
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Pool statistics via API

2019-10-11 Thread Ernesto Puerta
Hi Sinan,

If it's in the Dashboard, it sure comes from the Dashboard REST API (which
is an API completely unrelated to the RESTful Module).

To check the Dashboard REST API, log in there and click on the top-right
"?" menu, and in the dropdown, click on "API". That will lead you to the
Swagger/OpenAPI spec of the Dashboard. You will likely want to explore the
"/pool" and "/block" endpoints. The API page will give you ready-to-use
curl commands (the only thing you'd need to renew, once expired, is the
authorization token).

Kind regards,

Ernesto Puerta

He / Him / His

Senior Software Engineer, Ceph

Red Hat 



On Thu, Oct 10, 2019 at 2:16 PM Sinan Polat  wrote:

> Hi,
>
> Currently I am getting the pool statistics (especially USED/MAX AVAIL) via
> the command line:
> ceph df -f json-pretty| jq '.pools[] | select(.name == "poolname") |
> .stats.max_avail'
> ceph df -f json-pretty| jq '.pools[] | select(.name == "poolname") |
> .stats.bytes_used'
>
> Command "ceph df" does not show the (total) size of the provisioned RBD
> images. It only shows the real usage.
>
> I managed to get the total size of provisioned images using the Python rbd
> module https://docs.ceph.com/docs/master/rbd/api/librbdpy/
>
> Using the same Python module I also would like to get the USED/MAX AVAIL
> per pool. That should be possible using rbd.RBD().pool_stats_get, but
> unfortunately my python-rbd version doesn't support that (running 12.2.8).
>
> So I went ahead and enabled the dashboard to see if the data is present in
> the dashboard and it seems it is. Next step is to enable the restful module
> and access this information, right? But unfortunately the restful api
> doesn't provide this information.
>
> My question is, how can I access the USED/MAX AVAIL information of a pool
> without using the ceph command line and without upgrading my python-rbd
> package?
>
> Kind regards
> Sinan Polat 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] mds fail ing to start 14.2.2

2019-10-11 Thread Kenneth Waegeman

Hi all,

After solving some pg inconsistency problems, my fs is still in 
trouble.  my mds's are crashing with this error:




    -5> 2019-10-11 19:02:55.375 7f2d39f10700  1 mds.1.564276 rejoin_start
    -4> 2019-10-11 19:02:55.385 7f2d3d717700  5 mds.beacon.mds01 
received beacon reply up:rejoin seq 5 rtt 1.01
    -3> 2019-10-11 19:02:55.495 7f2d39f10700  1 mds.1.564276 
rejoin_joint_start
    -2> 2019-10-11 19:02:55.505 7f2d39f10700  5 mds.mds01 
handle_mds_map old map epoch 564279 <= 564279, discarding
    -1> 2019-10-11 19:02:55.695 7f2d33f04700 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/mds/mdstyp
es.h: In function 'static void 
dentry_key_t::decode_helper(std::string_view, std::string&, 
snapid_t&)' thread 7f2d33f04700 time 2019-10-11 19:02:55.703343
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/mds/mdstypes.h: 
1229: FAILED ceph_assert(i != string::npos

)

 ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) 
nautilus (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x14a) [0x7f2d43393046]
 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char 
const*, char const*, ...)+0) [0x7f2d43393214]
 3: (CDir::_omap_fetched(ceph::buffer::v14_2_0::list&, 
std::mapstd::less, std::allocatorceph::buffer::v14_2_0::list> > >&, bool, int)+0xa68) [0x556a17ec

baa8]
 4: (C_IO_Dir_OMAP_Fetched::finish(int)+0x54) [0x556a17ee0034]
 5: (MDSContext::complete(int)+0x70) [0x556a17f5e710]
 6: (MDSIOContextBase::complete(int)+0x16b) [0x556a17f5e9ab]
 7: (Finisher::finisher_thread_entry()+0x156) [0x7f2d433d8386]
 8: (()+0x7dd5) [0x7f2d41262dd5]
 9: (clone()+0x6d) [0x7f2d3ff1302d]

 0> 2019-10-11 19:02:55.695 7f2d33f04700 -1 *** Caught signal 
(Aborted) **

 in thread 7f2d33f04700 thread_name:fn_anonymous

 ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) 
nautilus (stable)

 1: (()+0xf5d0) [0x7f2d4126a5d0]
 2: (gsignal()+0x37) [0x7f2d3fe4b2c7]
 3: (abort()+0x148) [0x7f2d3fe4c9b8]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x199) [0x7f2d43393095]
 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char 
const*, char const*, ...)+0) [0x7f2d43393214]
 6: (CDir::_omap_fetched(ceph::buffer::v14_2_0::list&, 
std::mapstd::less, std::allocatorceph::buffer::v14_2_0::list> > >&, bool, int)+0xa68) [0x556a17ec

baa8]
 7: (C_IO_Dir_OMAP_Fetched::finish(int)+0x54) [0x556a17ee0034]
 8: (MDSContext::complete(int)+0x70) [0x556a17f5e710]
 9: (MDSIOContextBase::complete(int)+0x16b) [0x556a17f5e9ab]
 10: (Finisher::finisher_thread_entry()+0x156) [0x7f2d433d8386]
 11: (()+0x7dd5) [0x7f2d41262dd5]
 12: (clone()+0x6d) [0x7f2d3ff1302d]
 NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.


[root@mds02 ~]# ceph -s
  cluster:
    id: 92bfcf0a-1d39-43b3-b60f-44f01b630e47
    health: HEALTH_WARN
    1 filesystem is degraded
    insufficient standby MDS daemons available
    1 MDSs behind on trimming
    1 large omap objects

  services:
    mon: 3 daemons, quorum mds01,mds02,mds03 (age 4d)
    mgr: mds02(active, since 3w), standbys: mds01, mds03
    mds: ceph_fs:2/2 {0=mds02=up:rejoin,1=mds01=up:rejoin(laggy or 
crashed)}

    osd: 535 osds: 533 up, 529 in

  data:
    pools:   3 pools, 3328 pgs
    objects: 376.32M objects, 673 TiB
    usage:   1.0 PiB used, 2.2 PiB / 3.2 PiB avail
    pgs: 3315 active+clean
 12   active+clean+scrubbing+deep
 1    active+clean+scrubbing


Someone an idea where to go from here ?☺

Thanks!

K

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] lot of inconsistent+failed_repair - failed to pick suitable auth object (14.2.3)

2019-10-11 Thread Kenneth Waegeman



On 11/10/2019 01:21, Brad Hubbard wrote:

On Fri, Oct 11, 2019 at 12:27 AM Kenneth Waegeman
 wrote:

Hi Brad, all,

Pool 6 has min_size 2:

pool 6 'metadata' replicated size 3 min_size 2 crush_rule 1 object_hash
rjenkins pg_num 1024 pgp_num 1024 autoscale_mode warn last_change 172476
flags hashpspool stripe_width 0 application cephfs

This looked like something min_size 1 could cause, but I guess that's
not the cause here.


so inconsistens is empty, which is weird, no ?

Try scrubbing the pg just before running the command.


Ah that worked! I could then do the trick with the temporary_key to 
solve the inconsistent errors.


Thanks!!

K




Thanks again!

K


On 10/10/2019 12:52, Brad Hubbard wrote:

Does pool 6 have min_size = 1 set?

https://tracker.ceph.com/issues/24994#note-5 would possibly be helpful
here, depending on what the output of the following command looks
like.

# rados list-inconsistent-obj [pgid] --format=json-pretty

On Thu, Oct 10, 2019 at 8:16 PM Kenneth Waegeman
 wrote:

Hi all,

After some node failure and rebalancing, we have a lot of pg's in
inconsistent state. I tried to repair, but it din't work. This is also
in the logs:


2019-10-10 11:23:27.221 7ff54c9b0700  0 log_channel(cluster) log [DBG]
: 6.327 repair starts
2019-10-10 11:23:27.431 7ff5509b8700 -1 log_channel(cluster) log [ERR]
: 6.327 shard 19 soid 6:e4c130fd:::20005f3b582.:head :
omap_digest 0x334f57be != omap_digest 0xa8c4ce76 from auth oi
6:e4c130fd:::20005f3b582.:head(203789'1033530 osd.3.0:342
dirty|omap|data_digest|omap_digest s 0 uv 1032164 dd  od
a8c4ce76 alloc_hint [0 0 0])
2019-10-10 11:23:27.431 7ff5509b8700 -1 log_channel(cluster) log [ERR]
: 6.327 shard 72 soid 6:e4c130fd:::20005f3b582.:head :
omap_digest 0x334f57be != omap_digest 0xa8c4ce76 from auth oi
6:e4c130fd:::20005f3b582.:head(203789'1033530 osd.3.0:342
dirty|omap|data_digest|omap_digest s 0 uv 1032164 dd  od
a8c4ce76 alloc_hint [0 0 0])
2019-10-10 11:23:27.431 7ff5509b8700 -1 log_channel(cluster) log [ERR]
: 6.327 shard 91 soid 6:e4c130fd:::20005f3b582.:head :
omap_digest 0x334f57be != omap_digest 0xa8c4ce76 from auth oi
6:e4c130fd:::20005f3b582.:head(203789'1033530 osd.3.0:342
dirty|omap|data_digest|omap_digest s 0 uv 1032164 dd  od
a8c4ce76 alloc_hint [0 0 0])
2019-10-10 11:23:27.431 7ff5509b8700 -1 log_channel(cluster) log [ERR]
: 6.327 soid 6:e4c130fd:::20005f3b582.:head : failed to pick
suitable auth object
2019-10-10 11:23:27.731 7ff54c9b0700 -1 log_channel(cluster) log [ERR]
: 6.327 shard 19 soid 6:e4c2e57b:::20005f11daa.:head :
omap_digest 0x6aafaf97 != omap_digest 0x56dd55a2 from auth oi
6:e4c2e57b:::20005f11daa.:head(203789'1033711 osd.3.0:3666823
dirty|omap|data_digest|omap_digest s 0 uv 1032158 dd  od
56dd55a2 alloc_hint [0 0 0])
2019-10-10 11:23:27.731 7ff54c9b0700 -1 log_channel(cluster) log [ERR]
: 6.327 shard 72 soid 6:e4c2e57b:::20005f11daa.:head :
omap_digest 0x6aafaf97 != omap_digest 0x56dd55a2 from auth oi
6:e4c2e57b:::20005f11daa.:head(203789'1033711 osd.3.0:3666823
dirty|omap|data_digest|omap_digest s 0 uv 1032158 dd  od
56dd55a2 alloc_hint [0 0 0])
2019-10-10 11:23:27.731 7ff54c9b0700 -1 log_channel(cluster) log [ERR]
: 6.327 shard 91 soid 6:e4c2e57b:::20005f11daa.:head :
omap_digest 0x6aafaf97 != omap_digest 0x56dd55a2 from auth oi
6:e4c2e57b:::20005f11daa.:head(203789'1033711 osd.3.0:3666823
dirty|omap|data_digest|omap_digest s 0 uv 1032158 dd  od
56dd55a2 alloc_hint [0 0 0])
2019-10-10 11:23:27.731 7ff54c9b0700 -1 log_channel(cluster) log [ERR]
: 6.327 soid 6:e4c2e57b:::20005f11daa.:head : failed to pick
suitable auth object
2019-10-10 11:23:27.971 7ff54c9b0700 -1 log_channel(cluster) log [ERR]
: 6.327 shard 19 soid 6:e4c40009:::20005f45f1b.:head :
omap_digest 0x7ccf5cc9 != omap_digest 0xe048d29 from auth oi
6:e4c40009:::20005f45f1b.:head(203789'1033837 osd.3.0:3666949
dirty|omap|data_digest|omap_digest s 0 uv 1032168 dd  od
e048d29 alloc_hint [0 0 0])
2019-10-10 11:23:27.971 7ff54c9b0700 -1 log_channel(cluster) log [ERR]
: 6.327 shard 72 soid 6:e4c40009:::20005f45f1b.:head :
omap_digest 0x7ccf5cc9 != omap_digest 0xe048d29 from auth oi
6:e4c40009:::20005f45f1b.:head(203789'1033837 osd.3.0:3666949
dirty|omap|data_digest|omap_digest s 0 uv 1032168 dd  od
e048d29 alloc_hint [0 0 0])
2019-10-10 11:23:27.971 7ff54c9b0700 -1 log_channel(cluster) log [ERR]
: 6.327 shard 91 soid 6:e4c40009:::20005f45f1b.:head :
omap_digest 0x7ccf5cc9 != omap_digest 0xe048d29 from auth oi
6:e4c40009:::20005f45f1b.:head(203789'1033837 osd.3.0:3666949
dirty|omap|data_digest|omap_digest s 0 uv 1032168 dd  od
e048d29 alloc_hint [0 0 0])
2019-10-10 11:23:27.971 7ff54c9b0700 -1 log_channel(cluster) log [ERR]
: 6.327 soid 6:e4c40009:::20005f45f1b.:head : failed to pick
suitable auth object
2019-10-10 

Re: [ceph-users] ceph version 14.2.3-OSD fails

2019-10-11 Thread Stefan Priebe - Profihost AG

> Am 11.10.2019 um 14:07 schrieb Igor Fedotov :
> 
> 
> Hi!
> 
> originally your issue looked like the ones from 
> https://tracker.ceph.com/issues/42223
> 
> And it looks like lack of some key information for FreeListManager in RocksDB.
> 
> Once you have it present we can check the content of the RocksDB to prove 
> this hypothesis, please let me know if you want the guideline for that.
> 
> 
> 
> The last log is different, the key record is probably:
> 
> -2> 2019-10-09 23:03:47.011 7fb4295a7700 -1 rocksdb: submit_common error: 
> Corruption: block checksum mismatch: expected 2181709173, got 2130853119  in 
> db/204514.sst offset 0 size 61648 code = 2 Rocksdb transaction: 
> 
> which most probably denotes data corruption in DB. Unfortunately for now I 
> can't say if this is related to the original issue or not.
> 
> This time it reminds the issue shared in this mailing list a while ago by 
> Stefan Priebe. The post caption is "Bluestore OSDs keep crashing in 
> BlueStore.cc: 8808: FAILED assert(r == 0)"
> 
> So first of all I'd suggest to distinguish these issues for now and try to 
> troubleshoot them separately.
> 
> 
> 
> As for the first case I'm wondering if you have any OSDs still failing this 
> way, i.e. asserting in allocator and showing 0 extents loaded: "_open_alloc 
> loaded 0 B in 0 extents"
> 
> If so lets check DB content first.
> 
> 
> 
> For the second case I'm wondering the most if the issue is permanent for a 
> specific OSD or it disappears after OSD/node restart as it occurred in 
> Stefan's case?
> 

Just a note it came back shortly after some days. I‘m still waiting for a ceph 
release which fixes the issue v12.2.13...


Stefan
> 
> Thanks,
> 
> Igor
> 
> 
> 
> On 10/10/2019 1:59 PM, cephuser2345 user wrote:
>> Hi igor 
>> since the last osd crash we had some 4 more  tried to check RocksDB with 
>> ceph-kvstore-tool :
>> ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-71 compact 
>> ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-71  repair
>> ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-71  destructive-repair
>> 
>> nothing helped  we had  to redeploy the osd by removing it from the cluster 
>> and reinstalling 
>> 
>> we have updated  to ceph  14.2.4   2 weeks or more ago still osd's falling 
>> in the same way 
>> i have manged to to  capture the first fault  by using : ceph crash ls added 
>> the log+meta  to this email 
>> can something dose this logs can shed some light ?
>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Thu, Sep 12, 2019 at 7:20 PM Igor Fedotov  wrote:
 Hi,
 
 this line:
 
 -2> 2019-09-12 16:38:15.101 7fcd02fd1f80  1 
 bluestore(/var/lib/ceph/osd/ceph-71) _open_alloc loaded 0 B in 0 extents
 
 tells me that OSD is unable to load free list manager properly, i.e. list 
 of free/allocated blocks in unavailable.
 
 You might want to set 'debug bluestore = 10" and check additional log 
 output between 
 
 these two lines:
 
 -3> 2019-09-12 16:38:15.093 7fcd02fd1f80  1 
 bluestore(/var/lib/ceph/osd/ceph-71) _open_alloc opening allocation 
 metadata
 -2> 2019-09-12 16:38:15.101 7fcd02fd1f80  1 
 bluestore(/var/lib/ceph/osd/ceph-71) _open_alloc loaded 0 B in 0 extents
 
 And/or check RocksDB records prefixed with "b" prefix using 
 ceph-kvstore-tool.
 
 
 
 Igor
 
 
 
 P.S.
 
 Sorry, might be unresponsive for the next two week as I'm going on 
 vacation. 
 
 
 
 On 9/12/2019 7:04 PM, cephuser2345 user wrote:
> Hi
> we have updated  the ceph version from 14.2.2 to version 14.2.3.
> the osd getting :
> 
>   -2176.68713 host osd048 
>  66   hdd  12.78119 osd.66  up  1.0 1.0 
>  67   hdd  12.78119 osd.67  up  1.0 1.0 
>  68   hdd  12.78119 osd.68  up  1.0 1.0 
>  69   hdd  12.78119 osd.69  up  1.0 1.0 
>  70   hdd  12.78119 osd.70  up  1.0 1.0 
>  71   hdd  12.78119 osd.71down0 1.0 
> 
> we can not   get the osd  up  getting error its happening in alot of osds
> can you please assist :)  added txt log 
> bluestore(/var/lib/ceph/osd/ceph-71) _open_alloc opening allocation 
> metadata
> -2> 2019-09-12 16:38:15.101 7fcd02fd1f80  1 
> bluestore(/var/lib/ceph/osd/ceph-71) _open_alloc loaded 0 B in 0 extents
> -1> 2019-09-12 16:38:15.101 7fcd02fd1f80 -1 
> /build/ceph-14.2.3/src/os/bluestore/fastbmap_allocator_impl.h: In 
> function 'void AllocatorLevel02::_mark_allocated(uint64_t, uint64_t) 
> [with L1 = AllocatorLevel01Loose; uint64_t = long unsigned int]' thread 
> 7fcd02fd1f80 time 2019-09-12 16:38:15.102539
> 
> 
> ___
> ceph-users mailing list
> 

Re: [ceph-users] ceph version 14.2.3-OSD fails

2019-10-11 Thread Igor Fedotov

Hi!

originally your issue looked like the ones from 
https://tracker.ceph.com/issues/42223


And it looks like lack of some key information for FreeListManager in 
RocksDB.


Once you have it present we can check the content of the RocksDB to 
prove this hypothesis, please let me know if you want the guideline for 
that.



The last log is different, the key record is probably:

-2> 2019-10-09 23:03:47.011 7fb4295a7700 -1 rocksdb: submit_common 
error: Corruption: block checksum mismatch: expected 2181709173, got 
2130853119  in db/204514.sst offset 0 size 61648 code = 2 Rocksdb 
transaction:


which most probably denotes data corruption in DB. Unfortunately for now 
I can't say if this is related to the original issue or not.


This time it reminds the issue shared in this mailing list a while ago 
by Stefan Priebe. The post caption is "Bluestore OSDs keep crashing in 
BlueStore.cc: 8808: FAILED assert(r == 0)"


So first of all I'd suggest to distinguish these issues for now and try 
to troubleshoot them separately.



As for the first case I'm wondering if you have any OSDs still failing 
this way, i.e. asserting in allocator and showing 0 extents loaded: 
"_open_alloc loaded 0 B in 0 extents"


If so lets check DB content first.


For the second case I'm wondering the most if the issue is permanent for 
a specific OSD or it disappears after OSD/node restart as it occurred in 
Stefan's case?



Thanks,

Igor


On 10/10/2019 1:59 PM, cephuser2345 user wrote:

Hi igor
since the last osd crash we had some 4 more  tried to check RocksDB 
with ceph-kvstore-tool :

ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-71 compact
ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-71 repair
ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-71 
destructive-repair


nothing helped  we had  to redeploy the osd by removing it from the 
cluster and reinstalling


we have updated  to ceph  14.2.4   2 weeks or more ago still osd's 
falling in the same way
i have manged to to  capture the first fault  by using : ceph crash ls 
added the log+meta  to this email

can something dose this logs can shed some light ?










On Thu, Sep 12, 2019 at 7:20 PM Igor Fedotov mailto:ifedo...@suse.de>> wrote:

Hi,

this line:

    -2> 2019-09-12 16:38:15.101 7fcd02fd1f80  1
bluestore(/var/lib/ceph/osd/ceph-71) _open_alloc loaded 0 B in
0 extents

tells me that OSD is unable to load free list manager
properly, i.e. list of free/allocated blocks in unavailable.

You might want to set 'debug bluestore = 10" and check
additional log output between

these two lines:

    -3> 2019-09-12 16:38:15.093 7fcd02fd1f80  1
bluestore(/var/lib/ceph/osd/ceph-71) _open_alloc opening
allocation metadata
    -2> 2019-09-12 16:38:15.101 7fcd02fd1f80  1
bluestore(/var/lib/ceph/osd/ceph-71) _open_alloc loaded 0 B in
0 extents

And/or check RocksDB records prefixed with "b" prefix using
ceph-kvstore-tool.


Igor


P.S.

Sorry, might be unresponsive for the next two week as I'm
going on vacation.


On 9/12/2019 7:04 PM, cephuser2345 user wrote:

Hi
we have updated  the ceph version from 14.2.2 to version 14.2.3.
the osd getting :

  -21        76.68713     host osd048
 66   hdd  12.78119         osd.66      up  1.0 1.0
 67   hdd  12.78119         osd.67      up  1.0 1.0
 68   hdd  12.78119         osd.68      up  1.0 1.0
 69   hdd  12.78119         osd.69      up  1.0 1.0
 70   hdd  12.78119         osd.70      up  1.0 1.0
 71   hdd  12.78119         osd.71    down  0 1.0

we can not   get the osd  up  getting error its happening in
alot of osds
can you please assist :)  added txt log
bluestore(/var/lib/ceph/osd/ceph-71) _open_alloc opening
allocation metadata
    -2> 2019-09-12 16:38:15.101 7fcd02fd1f80  1
bluestore(/var/lib/ceph/osd/ceph-71) _open_alloc loaded 0 B
in 0 extents
    -1> 2019-09-12 16:38:15.101 7fcd02fd1f80 -1
/build/ceph-14.2.3/src/os/bluestore/fastbmap_allocator_impl.h:
In function 'void
AllocatorLevel02::_mark_allocated(uint64_t, uint64_t)
[with L1 = AllocatorLevel01Loose; uint64_t = long unsigned
int]' thread 7fcd02fd1f80 time 2019-09-12 16:38:15.102539

___
ceph-users mailing list
ceph-users@lists.ceph.com  
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rgw: multisite support

2019-10-11 Thread M Ranga Swami Reddy
I have setup the realm, zonegroup and master zone. Now Iam pulling the
realm details from master to secondary...it failed with "request failed:
(22) invalid argument"
==
radosgw-admin realm pull --url={url}, --access-key={key}  --secret={key}
request failed: (22) Invalid argument
==

On Mon, Oct 7, 2019 at 12:46 PM M Ranga Swami Reddy 
wrote:

> Thank you...Let me confirm the same..and update here.
>
> On Sat, Oct 5, 2019 at 12:27 AM  wrote:
>
>> Swami;
>>
>> For 12.2.11 (Luminous), the previously linked document would be:
>>
>> https://docs.ceph.com/docs/luminous/radosgw/multisite/#migrating-a-single-site-system-to-multi-site
>>
>> Thank you,
>>
>> Dominic L. Hilsbos, MBA
>> Director – Information Technology
>> Perform Air International Inc.
>> dhils...@performair.com
>> www.PerformAir.com
>>
>>
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Joachim Kraftmayer
>> Sent: Friday, October 04, 2019 7:50 AM
>> To: M Ranga Swami Reddy
>> Cc: ceph-users; d...@ceph.io
>> Subject: Re: [ceph-users] rgw: multisite support
>>
>> Maybe this will help you:
>>
>> https://docs.ceph.com/docs/master/radosgw/multisite/#migrating-a-single-site-system-to-multi-site
>>
>> ___
>>
>> Clyso GmbH
>>
>>
>> Am 03.10.2019 um 13:32 schrieb M Ranga Swami Reddy:
>> Thank you. Do we have a quick document to do this migration?
>>
>> Thanks
>> Swami
>>
>> On Thu, Oct 3, 2019 at 4:38 PM Paul Emmerich 
>> wrote:
>> On Thu, Oct 3, 2019 at 12:03 PM M Ranga Swami Reddy
>>  wrote:
>> >
>> > Below url says: "Switching from a standalone deployment to a multi-site
>> replicated deployment is not supported.
>> >
>> https://docs.openstack.org/project-deploy-guide/charm-deployment-guide/latest/app-rgw-multisite.html
>>
>> this is wrong, might be a weird openstack-specific restriction.
>>
>> Migrating single-site to multi-site is trivial, you just add the second
>> site.
>>
>>
>> Paul
>>
>> >
>> > Please advise.
>> >
>> >
>> > On Thu, Oct 3, 2019 at 3:28 PM M Ranga Swami Reddy <
>> swamire...@gmail.com> wrote:
>> >>
>> >> Hi,
>> >> Iam using the 2 ceph clusters in diff DCs (away by 500 KM) with ceph
>> 12.2.11 version.
>> >> Now, I want to setup rgw multisite using the above 2 ceph clusters.
>> >>
>> >> is it possible? if yes, please share good document to do the same.
>> >>
>> >> Thanks
>> >> Swami
>> >
>> > ___
>> > Dev mailing list -- d...@ceph.io
>> > To unsubscribe send an email to dev-le...@ceph.io
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com