[ceph-users] Ceph + SAMBA (vfs_ceph)

2019-08-27 Thread Salsa
I'm running a ceph installation on a lab to evaluate for production and I have 
a cluster running, but I need to mount on different windows servers and 
desktops. I created an NFS share and was able to mount it on my Linux desktop, 
but not a Win 10 desktop. Since it seems that Windows server 2016 is required 
to mount the NFS share I quit that route and decided to try samba.

I compiled a version of Samba that has this vfs_ceph module, but I can't set it 
up correctly. It seems I'm missing some user configuration as I've hit this 
error:

"
~$ smbclient -U samba.gw //10.17.6.68/cephfs_a
WARNING: The "syslog" option is deprecated
Enter WORKGROUP\samba.gw's password:
session setup failed: NT_STATUS_LOGON_FAILURE
"
Does anyone know of any good setup tutorial to follow?

This is my smb config so far:

# Global parameters
[global]
load printers = No
netbios name = SAMBA-CEPH
printcap name = cups
security = USER
workgroup = CEPH
smbd: backgroundqueue = no
idmap config * : backend = tdb
cups options = raw
valid users = samba

[cephfs]
create mask = 0777
directory mask = 0777
guest ok = Yes
guest only = Yes
kernel share modes = No
path = /
read only = No
vfs objects = ceph
ceph: user_id = samba
ceph:config_file = /etc/ceph/ceph.conf

Thanks

--
Salsa

Sent with [ProtonMail](https://protonmail.com) Secure Email.___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Scientific Computing User Group

2019-08-27 Thread Kevin Hrpcek
The first ceph + htc/hpc/science virtual user group meeting is tomorrow 
Wednesday August 28th at 10:30am us eastern/4:30pm eu central time. Duration 
will be kept to <= 1 hour.

I'd like this to be conducted as a user group and not only one person 
talking/presenting. For this first meeting I'd like to get input from everyone 
on the call regarding what field they are in and how ceph is used as a solution 
for their implementation. We'll see where it goes from there. Use the pad link 
below to get to a url for live meeting notes.

Meeting connection details from the ceph community calendar:

Description: Main pad for discussions: 
https://pad.ceph.com/p/Ceph_Science_User_Group_Index

Meetings will be recorded and posted to the Ceph Youtube channel.

To join the meeting on a computer or mobile phone: 
https://bluejeans.com/908675367?src=calendarLink

To join from a Red Hat Deskphone or Softphone, dial: 84336.

Connecting directly from a room system?
1.) Dial: 199.48.152.152 or 
bjn.vc
2.) Enter Meeting ID: 908675367
Just want to dial in on your phone?
1.) Dial one of the following numbers: 408-915-6466 (US)
See all numbers: 
https://www.redhat.com/en/conference-numbers
2.) Enter Meeting ID: 908675367 3.) Press #

Want to test your video connection? 
https://bluejeans.com/111

Kevin


On 8/2/19 12:08 PM, Mike Perez wrote:
We have scheduled the next meeting on the community calendar for August 28 at 
14:30 UTC. Each meeting will then take place on the last Wednesday of each 
month.

Here's the pad to collect agenda/notes: 
https://pad.ceph.com/p/Ceph_Science_User_Group_Index

--
Mike Perez (thingee)


On Tue, Jul 23, 2019 at 10:40 AM Kevin Hrpcek 
mailto:kevin.hrp...@ssec.wisc.edu>> wrote:
Update

We're going to hold off until August for this so we can promote it on the Ceph 
twitter with more notice. Sorry for the inconvenience if you were planning on 
the meeting tomorrow. Keep a watch on the list, twitter, or ceph calendar for 
updates.

Kevin


On 7/5/19 11:15 PM, Kevin Hrpcek wrote:
We've had some positive feedback and will be moving forward with this user 
group. The first virtual user group meeting is planned for July 24th at 4:30pm 
central European time/10:30am American eastern time. We will keep it to an hour 
in length. The plan is to use the ceph bluejeans video conferencing and it will 
be put on the ceph community calendar. I will send out links when it is closer 
to the 24th.

The goal of this user group is to promote conversations and sharing ideas for 
how ceph is used in the the scientific/hpc/htc communities. Please be willing 
to discuss your use cases, cluster configs, problems you've had, shortcomings 
in ceph, etc... Not everyone pays attention to the ceph lists so feel free to 
share the meeting information with others you know that may be interested in 
joining in.

Contact me if you have questions, comments, suggestions, or want to volunteer a 
topic for meetings. I will be brainstorming some conversation starters but it 
would also be interesting to have people give a deep dive into their use of 
ceph and what they have built around it to support the science being done at 
their facility.

Kevin



On 6/17/19 10:43 AM, Kevin Hrpcek wrote:
Hey all,

At cephalocon some of us who work in scientific computing got together for a 
BoF and had a good conversation. There was some interest in finding a way to 
continue the conversation focused on ceph in scientific computing and htc/hpc 
environments. We are considering putting together monthly video conference user 
group meeting to facilitate sharing thoughts and ideas for this part of the 
ceph community. At cephalocon we mostly had teams present from the EU so I'm 
interested in hearing how much community interest there is in a 
ceph+science/HPC/HTC user group meeting. It will be impossible to pick a time 
that works well for everyone but initially we considered something later in the 
work day for EU countries.

Reply to me if you're interested and please include your timezone.

Kevin



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



Re: [ceph-users] iostat and dashboard freezing

2019-08-27 Thread Reed Dier
Just to further piggyback,

Probably the most "hard" the mgr seems to get pushed is when the balancer is 
engaged.
When trying to eval a pool or cluster, it takes upwards of 30-120 seconds for 
it to score it, and then another 30-120 seconds to execute the plan, and it 
never seems to engage automatically.

> $ time ceph balancer status
> {
> "active": true,
> "plans": [],
> "mode": "upmap"
> }
> 
> real0m36.490s
> user0m0.259s
> sys 0m0.044s


I'm going to disable mine as well, and see if I can stop waking up to 'No 
Active MGR.'


You can see when I lose mgr's because RBD image stats go to 0 until I catch it.

Thanks,

Reed

> On Aug 27, 2019, at 11:24 AM, Jake Grimmett  wrote:
> 
> Hi Reed, Lenz, John
> 
> I've just tried disabling the balancer, so far ceph-mgr is keeping it's
> CPU mostly under 20%, even with both the iostat and dashboard back on.
> 
> # ceph balancer off
> 
> was
> [root@ceph-s1 backup]# ceph balancer status
> {
>"active": true,
>"plans": [],
>"mode": "upmap"
> }
> 
> now
> [root@ceph-s1 backup]# ceph balancer status
> {
>"active": false,
>"plans": [],
>"mode": "upmap"
> }
> 
> We are using 8:2 erasure encoding across 324 12TB OSD, plus 4 NVMe OSD
> for a replicated cephfs metadata pool.
> 
> let me know if the balancer is your problem too...
> 
> best,
> 
> Jake
> 
> On 8/27/19 3:57 PM, Jake Grimmett wrote:
>> Yes, the problem still occurs with the dashboard disabled...
>> 
>> Possibly relevant, when both the dashboard and iostat plugins are
>> disabled, I occasionally see ceph-mgr rise to 100% CPU.
>> 
>> as suggested by John Hearns, the output of  gstack ceph-mgr when at 100%
>> is here:
>> 
>> http://p.ip.fi/52sV
>> 
>> many thanks
>> 
>> Jake
>> 
>> On 8/27/19 3:09 PM, Reed Dier wrote:
>>> I'm currently seeing this with the dashboard disabled.
>>> 
>>> My instability decreases, but isn't wholly cured, by disabling
>>> prometheus and rbd_support, which I use in tandem, as the only thing I'm
>>> using the prom-exporter for is the per-rbd metrics.
>>> 
 ceph mgr module ls
 {
 "enabled_modules": [
 "diskprediction_local",
 "influx",
 "iostat",
 "prometheus",
 "rbd_support",
 "restful",
 "telemetry"
 ],
>>> 
>>> I'm on Ubuntu 18.04, so that doesn't corroborate with some possible OS
>>> correlation.
>>> 
>>> Thanks,
>>> 
>>> Reed
>>> 
 On Aug 27, 2019, at 8:37 AM, Lenz Grimmer >>> > wrote:
 
 Hi Jake,
 
 On 8/27/19 3:22 PM, Jake Grimmett wrote:
 
> That exactly matches what I'm seeing:
> 
> when iostat is working OK, I see ~5% CPU use by ceph-mgr
> and when iostat freezes, ceph-mgr CPU increases to 100%
 
 Does this also occur if the dashboard module is disabled? Just wondering
 if this is isolatable to the iostat module. Thanks!
 
 Lenz
 
 -- 
 SUSE Software Solutions Germany GmbH - Maxfeldstr. 5 - 90409 Nuernberg
 GF: Felix Imendörffer, HRB 247165 (AG Nürnberg)
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com 
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>>> 
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>> 
>> 
> 
> 
> -- 
> MRC Laboratory of Molecular Biology
> Francis Crick Avenue,
> Cambridge CB2 0QH, UK.
> 



smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iostat and dashboard freezing

2019-08-27 Thread Jake Grimmett
Hi Reed, Lenz, John

I've just tried disabling the balancer, so far ceph-mgr is keeping it's
CPU mostly under 20%, even with both the iostat and dashboard back on.

# ceph balancer off

was
[root@ceph-s1 backup]# ceph balancer status
{
"active": true,
"plans": [],
"mode": "upmap"
}

now
[root@ceph-s1 backup]# ceph balancer status
{
"active": false,
"plans": [],
"mode": "upmap"
}

We are using 8:2 erasure encoding across 324 12TB OSD, plus 4 NVMe OSD
for a replicated cephfs metadata pool.

let me know if the balancer is your problem too...

best,

Jake

On 8/27/19 3:57 PM, Jake Grimmett wrote:
> Yes, the problem still occurs with the dashboard disabled...
> 
> Possibly relevant, when both the dashboard and iostat plugins are
> disabled, I occasionally see ceph-mgr rise to 100% CPU.
> 
> as suggested by John Hearns, the output of  gstack ceph-mgr when at 100%
> is here:
> 
> http://p.ip.fi/52sV
> 
> many thanks
> 
> Jake
> 
> On 8/27/19 3:09 PM, Reed Dier wrote:
>> I'm currently seeing this with the dashboard disabled.
>>
>> My instability decreases, but isn't wholly cured, by disabling
>> prometheus and rbd_support, which I use in tandem, as the only thing I'm
>> using the prom-exporter for is the per-rbd metrics.
>>
>>> ceph mgr module ls
>>> {
>>>     "enabled_modules": [
>>>         "diskprediction_local",
>>>         "influx",
>>>         "iostat",
>>>         "prometheus",
>>>         "rbd_support",
>>>         "restful",
>>>         "telemetry"
>>>     ],
>>
>> I'm on Ubuntu 18.04, so that doesn't corroborate with some possible OS
>> correlation.
>>
>> Thanks,
>>
>> Reed
>>
>>> On Aug 27, 2019, at 8:37 AM, Lenz Grimmer >> > wrote:
>>>
>>> Hi Jake,
>>>
>>> On 8/27/19 3:22 PM, Jake Grimmett wrote:
>>>
 That exactly matches what I'm seeing:

 when iostat is working OK, I see ~5% CPU use by ceph-mgr
 and when iostat freezes, ceph-mgr CPU increases to 100%
>>>
>>> Does this also occur if the dashboard module is disabled? Just wondering
>>> if this is isolatable to the iostat module. Thanks!
>>>
>>> Lenz
>>>
>>> -- 
>>> SUSE Software Solutions Germany GmbH - Maxfeldstr. 5 - 90409 Nuernberg
>>> GF: Felix Imendörffer, HRB 247165 (AG Nürnberg)
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com 
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> 
> 


-- 
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Recovery from "FAILED assert(omap_num_objs <= MAX_OBJECTS)"

2019-08-27 Thread Zoë O'Connell
We have run in to what looks like bug 36094 
(https://tracker.ceph.com/issues/36094) on our 13.2.6 cluster and 
unfortunately now one of our ranks (Rank 1) won't start - it comes up 
for a few seconds before the assigned MDS crashes again with the below 
log entries. It would appear that OpenFileTable has somehow become 
corrupted, but it's not clear from any of the Ceph tool documentation if 
there is any way of clearing this.


Before we resort to deleting and recreating the cluster, are there any 
further recovery steps we can perform?


Thanks.

2019-08-27 16:10:50.775 7f2c94581700 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm/el7/BUILD/ceph-13.2.6/src/mds/OpenFileTable.cc: 
In function 'void OpenFileTable::commit(MDSInternalContextBase*, 
uint64_t, int)' thread 7f2c94581700 time 2019-08-27 16:10:50.774858
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm/el7/BUILD/ceph-13.2.6/src/mds/OpenFileTable.cc: 
473: FAILED assert(omap_num_objs <= MAX_OBJECTS)


 ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic 
(stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x14b) [0x7f2ca064636b]

 2: (()+0x26e4f7) [0x7f2ca06464f7]
 3: (OpenFileTable::commit(MDSInternalContextBase*, unsigned long, 
int)+0x1b35) [0x557afbe49265]

 4: (MDLog::trim(int)+0x5a6) [0x557afbe36a86]
 5: (MDSRankDispatcher::tick()+0x24b) [0x557afbbcd97b]
 6: (FunctionContext::finish(int)+0x2c) [0x557afbbb326c]
 7: (Context::complete(int)+0x9) [0x557afbbb0ef9]
 8: (SafeTimer::timer_thread()+0x18b) [0x7f2ca0642c3b]
 9: (SafeTimerThread::entry()+0xd) [0x7f2ca06441fd]
 10: (()+0x7dd5) [0x7f2c9e284dd5]
 11: (clone()+0x6d) [0x7f2c9d36202d]

2019-08-27 16:10:50.777 7f2c94581700 -1 *** Caught signal (Aborted) **
 in thread 7f2c94581700 thread_name:safe_timer

 ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic 
(stable)

 1: (()+0xf5d0) [0x7f2c9e28c5d0]
 2: (gsignal()+0x37) [0x7f2c9d29a2c7]
 3: (abort()+0x148) [0x7f2c9d29b9b8]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x248) [0x7f2ca0646468]

 5: (()+0x26e4f7) [0x7f2ca06464f7]
 6: (OpenFileTable::commit(MDSInternalContextBase*, unsigned long, 
int)+0x1b35) [0x557afbe49265]

 7: (MDLog::trim(int)+0x5a6) [0x557afbe36a86]
 8: (MDSRankDispatcher::tick()+0x24b) [0x557afbbcd97b]
 9: (FunctionContext::finish(int)+0x2c) [0x557afbbb326c]
 10: (Context::complete(int)+0x9) [0x557afbbb0ef9]
 11: (SafeTimer::timer_thread()+0x18b) [0x7f2ca0642c3b]
 12: (SafeTimerThread::entry()+0xd) [0x7f2ca06441fd]
 13: (()+0x7dd5) [0x7f2c9e284dd5]
 14: (clone()+0x6d) [0x7f2c9d36202d]
 NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MON DNS Lookup & Version 2 Protocol

2019-08-27 Thread Jason Dillaman
On Wed, Jul 17, 2019 at 3:07 PM  wrote:
>
> All;
>
> I'm trying to firm up my understanding of how Ceph works, and ease of 
> management tools and capabilities.
>
> I stumbled upon this: 
> http://docs.ceph.com/docs/nautilus/rados/configuration/mon-lookup-dns/
>
> It got me wondering; how do you convey protocol version 2 capabilities in 
> this format?
>
> The examples all list port 6789, which is the port for protocol version 1.  
> Would I add SRV records for port 3300?  How does the client distinguish v1 
> from v2 in this case?

If you specify the default v1 port it assumes the v1 protocol and if
you specify the default v2 port it assumes the v2 protocol. If you
don't specify a port, it will try both v1 and v2 at the default port
locations. Otherwise, it again tries both protocols against the
specified custom port. [1]

> Thank you,
>
> Dominic L. Hilsbos, MBA
> Director - Information Technology
> Perform Air International, Inc.
> dhils...@performair.com
> www.PerformAir.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[1] https://github.com/ceph/ceph/blob/master/src/mon/MonMap.cc#L398

-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iostat and dashboard freezing

2019-08-27 Thread Jake Grimmett
Yes, the problem still occurs with the dashboard disabled...

Possibly relevant, when both the dashboard and iostat plugins are
disabled, I occasionally see ceph-mgr rise to 100% CPU.

as suggested by John Hearns, the output of  gstack ceph-mgr when at 100%
is here:

http://p.ip.fi/52sV

many thanks

Jake

On 8/27/19 3:09 PM, Reed Dier wrote:
> I'm currently seeing this with the dashboard disabled.
> 
> My instability decreases, but isn't wholly cured, by disabling
> prometheus and rbd_support, which I use in tandem, as the only thing I'm
> using the prom-exporter for is the per-rbd metrics.
> 
>> ceph mgr module ls
>> {
>>     "enabled_modules": [
>>         "diskprediction_local",
>>         "influx",
>>         "iostat",
>>         "prometheus",
>>         "rbd_support",
>>         "restful",
>>         "telemetry"
>>     ],
> 
> I'm on Ubuntu 18.04, so that doesn't corroborate with some possible OS
> correlation.
> 
> Thanks,
> 
> Reed
> 
>> On Aug 27, 2019, at 8:37 AM, Lenz Grimmer > > wrote:
>>
>> Hi Jake,
>>
>> On 8/27/19 3:22 PM, Jake Grimmett wrote:
>>
>>> That exactly matches what I'm seeing:
>>>
>>> when iostat is working OK, I see ~5% CPU use by ceph-mgr
>>> and when iostat freezes, ceph-mgr CPU increases to 100%
>>
>> Does this also occur if the dashboard module is disabled? Just wondering
>> if this is isolatable to the iostat module. Thanks!
>>
>> Lenz
>>
>> -- 
>> SUSE Software Solutions Germany GmbH - Maxfeldstr. 5 - 90409 Nuernberg
>> GF: Felix Imendörffer, HRB 247165 (AG Nürnberg)
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Head Of Scientific Computing
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSDs keep crashing in BlueStore.cc: 8808: FAILED assert(r == 0)

2019-08-27 Thread Stefan Priebe - Profihost AG

Am 27.08.19 um 16:20 schrieb Igor Fedotov:
> It sounds like OSD is "recovering" after checksum error.

May be no idea how this works. systemd is starting the osd again after
crashing and than it runs for weeks or days again.

> I.e. just failed OSD shows no errors in fsck and is able to restart and
> process new write requests for long enough period (longer than just a
> couple of minutes). Are these statements true?
Yes normally it runs for weeks - not sure if one crashes two times or
just once.


> If so I can suppose this
> is accidental/volatile issue rather than data-at-rest corruption.
> Something like data incorrectly read from disk.
> 
> Are you using standalone disk drive for DB/WAL or it's shared with main
> one?

Standalone disks.

> Just in case as a low handing fruit - I'd suggest checking with
> dmesg and smartctl for drive errors...

no sorry not that easy ;-) and also this would mean nearly 50 to 60 ssds
and around 30 servers have suddently hw errors.

> FYI: one more reference for the similar issue:
> https://tracker.ceph.com/issues/24968
> 
> Also I recall an issue with some kernels that caused occasional invalid
> data reads under high memory pressure/swapping:
> https://tracker.ceph.com/issues/22464

We have a current 4.19.X kernel and no memory limit. Mem avail is pretty
constant at 32GB.

Greets,
Stefan

> 
> IMO memory usage worth checking as well...
> 
> 
> Igor
> 
> 
> On 8/27/2019 4:52 PM, Stefan Priebe - Profihost AG wrote:
>> see inline
>>
>> Am 27.08.19 um 15:43 schrieb Igor Fedotov:
>>> see inline
>>>
>>> On 8/27/2019 4:41 PM, Stefan Priebe - Profihost AG wrote:
 Hi Igor,

 Am 27.08.19 um 14:11 schrieb Igor Fedotov:
> Hi Stefan,
>
> this looks like a duplicate for
>
> https://tracker.ceph.com/issues/37282
>
> Actually the root cause selection might be quite wide.
>
>   From HW issues to broken logic in RocksDB/BlueStore/BlueFS etc.
>
> As far as I understand you have different OSDs which are failing,
> right?
 Yes i've seen this on around 50 different OSDs running different HW but
 all run ceph 12.2.12. I've not seen this with 12.2.10 which we were
 running before.

> Is the set of these broken OSDs limited somehow?
 No at least i'm not able to find


> Any specific subset which is failing or something? E.g. just N of them
> are failing from time to time.
 No seems totally random.

> Any similarities for broken OSDs (e.g. specific hardware)?
 All run intel xeon CPUs and all run linux ;-)

> Did you run fsck for any of broken OSDs? Any reports?
 Yes but no reports.
>>> Are you saying that fsck is fine for OSDs that showed this sort of
>>> errors?
>> Yes fsck does not show a single error - everything is fine.
>>
> Any other errors/crashes in logs before these sort of issues happens?
 No


> Just in case - what allocator are you using?
 tcmalloc
>>> I meant BlueStore allocator - is it stupid or bitmap?
>> ah the default one i think this is stupid.
>>
>> Greets,
>> Stefan
>>
 Greets,
 Stefan

> Thanks,
>
> Igor
>
>
>
> On 8/27/2019 1:03 PM, Stefan Priebe - Profihost AG wrote:
>> Hello,
>>
>> since some month all our bluestore OSDs keep crashing from time to
>> time.
>> Currently about 5 OSDs per day.
>>
>> All of them show the following trace:
>> Trace:
>> 2019-07-24 08:36:48.995397 7fb19a711700 -1 rocksdb:
>> submit_transaction
>> error: Corruption: block checksum mismatch code = 2 Rocksdb
>> transaction:
>> Put( Prefix = M key =
>> 0x09a5'.916366.74680351' Value size =
>> 184)
>> Put( Prefix = M key = 0x09a5'._fastinfo' Value size =
>> 186)
>> Put( Prefix = O key =
>> 0x7f8003bb605f'd!rbd_data.afe49a6b8b4567.3c11!='0xfffe6f0024'x'
>>
>>
>>
>> Value size = 530)
>> Put( Prefix = O key =
>> 0x7f8003bb605f'd!rbd_data.afe49a6b8b4567.3c11!='0xfffe'o'
>>
>>
>>
>> Value size = 510)
>> Put( Prefix = L key = 0x10ba60f1 Value size = 4135)
>> 2019-07-24 08:36:49.012110 7fb19a711700 -1
>> /build/ceph/src/os/bluestore/BlueStore.cc: In function 'void
>> BlueStore::_kv_sync_thread()' thread 7fb19a711700 time 2019-07-24
>> 08:36:48.995415
>> /build/ceph/src/os/bluestore/BlueStore.cc: 8808: FAILED assert(r
>> == 0)
>>
>> ceph version 12.2.12-7-g1321c5e91f
>> (1321c5e91f3d5d35dd5aa5a0029a54b9a8ab9498) luminous (stable)
>>     1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x102) [0x5653a010e222]
>>     2: (BlueStore::_kv_sync_thread()+0x24c5) [0x56539ff964b5]
>>     3: (BlueStore::KVSyncThread::entry()+0xd) [0x56539ffd708d]
>>     4: (()+0x7494) 

Re: [ceph-users] Bluestore OSDs keep crashing in BlueStore.cc: 8808: FAILED assert(r == 0)

2019-08-27 Thread Igor Fedotov

It sounds like OSD is "recovering" after checksum error.

I.e. just failed OSD shows no errors in fsck and is able to restart and 
process new write requests for long enough period (longer than just a 
couple of minutes). Are these statements true? If so I can suppose this 
is accidental/volatile issue rather than data-at-rest corruption. 
Something like data incorrectly read from disk.


Are you using standalone disk drive for DB/WAL or it's shared with main 
one? Just in case as a low handing fruit - I'd suggest checking with 
dmesg and smartctl for drive errors...


FYI: one more reference for the similar issue: 
https://tracker.ceph.com/issues/24968


HW issue this time...


Also I recall an issue with some kernels that caused occasional invalid 
data reads under high memory pressure/swapping: 
https://tracker.ceph.com/issues/22464


IMO memory usage worth checking as well...


Igor


On 8/27/2019 4:52 PM, Stefan Priebe - Profihost AG wrote:

see inline

Am 27.08.19 um 15:43 schrieb Igor Fedotov:

see inline

On 8/27/2019 4:41 PM, Stefan Priebe - Profihost AG wrote:

Hi Igor,

Am 27.08.19 um 14:11 schrieb Igor Fedotov:

Hi Stefan,

this looks like a duplicate for

https://tracker.ceph.com/issues/37282

Actually the root cause selection might be quite wide.

  From HW issues to broken logic in RocksDB/BlueStore/BlueFS etc.

As far as I understand you have different OSDs which are failing, right?

Yes i've seen this on around 50 different OSDs running different HW but
all run ceph 12.2.12. I've not seen this with 12.2.10 which we were
running before.


Is the set of these broken OSDs limited somehow?

No at least i'm not able to find



Any specific subset which is failing or something? E.g. just N of them
are failing from time to time.

No seems totally random.


Any similarities for broken OSDs (e.g. specific hardware)?

All run intel xeon CPUs and all run linux ;-)


Did you run fsck for any of broken OSDs? Any reports?

Yes but no reports.

Are you saying that fsck is fine for OSDs that showed this sort of errors?

Yes fsck does not show a single error - everything is fine.


Any other errors/crashes in logs before these sort of issues happens?

No



Just in case - what allocator are you using?

tcmalloc

I meant BlueStore allocator - is it stupid or bitmap?

ah the default one i think this is stupid.

Greets,
Stefan


Greets,
Stefan


Thanks,

Igor



On 8/27/2019 1:03 PM, Stefan Priebe - Profihost AG wrote:

Hello,

since some month all our bluestore OSDs keep crashing from time to
time.
Currently about 5 OSDs per day.

All of them show the following trace:
Trace:
2019-07-24 08:36:48.995397 7fb19a711700 -1 rocksdb: submit_transaction
error: Corruption: block checksum mismatch code = 2 Rocksdb
transaction:
Put( Prefix = M key =
0x09a5'.916366.74680351' Value size = 184)
Put( Prefix = M key = 0x09a5'._fastinfo' Value size = 186)
Put( Prefix = O key =
0x7f8003bb605f'd!rbd_data.afe49a6b8b4567.3c11!='0xfffe6f0024'x'


Value size = 530)
Put( Prefix = O key =
0x7f8003bb605f'd!rbd_data.afe49a6b8b4567.3c11!='0xfffe'o'


Value size = 510)
Put( Prefix = L key = 0x10ba60f1 Value size = 4135)
2019-07-24 08:36:49.012110 7fb19a711700 -1
/build/ceph/src/os/bluestore/BlueStore.cc: In function 'void
BlueStore::_kv_sync_thread()' thread 7fb19a711700 time 2019-07-24
08:36:48.995415
/build/ceph/src/os/bluestore/BlueStore.cc: 8808: FAILED assert(r == 0)

ceph version 12.2.12-7-g1321c5e91f
(1321c5e91f3d5d35dd5aa5a0029a54b9a8ab9498) luminous (stable)
    1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x5653a010e222]
    2: (BlueStore::_kv_sync_thread()+0x24c5) [0x56539ff964b5]
    3: (BlueStore::KVSyncThread::entry()+0xd) [0x56539ffd708d]
    4: (()+0x7494) [0x7fb1ab2f6494]
    5: (clone()+0x3f) [0x7fb1aa37dacf]

I already opend up a tracker:
https://tracker.ceph.com/issues/41367

Can anybody help? Is this known?

Greets,
Stefan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iostat and dashboard freezing

2019-08-27 Thread Reed Dier
I'm currently seeing this with the dashboard disabled.

My instability decreases, but isn't wholly cured, by disabling prometheus and 
rbd_support, which I use in tandem, as the only thing I'm using the 
prom-exporter for is the per-rbd metrics.

> ceph mgr module ls
> {
> "enabled_modules": [
> "diskprediction_local",
> "influx",
> "iostat",
> "prometheus",
> "rbd_support",
> "restful",
> "telemetry"
> ],

I'm on Ubuntu 18.04, so that doesn't corroborate with some possible OS 
correlation.

Thanks,

Reed

> On Aug 27, 2019, at 8:37 AM, Lenz Grimmer  wrote:
> 
> Hi Jake,
> 
> On 8/27/19 3:22 PM, Jake Grimmett wrote:
> 
>> That exactly matches what I'm seeing:
>> 
>> when iostat is working OK, I see ~5% CPU use by ceph-mgr
>> and when iostat freezes, ceph-mgr CPU increases to 100%
> 
> Does this also occur if the dashboard module is disabled? Just wondering
> if this is isolatable to the iostat module. Thanks!
> 
> Lenz
> 
> -- 
> SUSE Software Solutions Germany GmbH - Maxfeldstr. 5 - 90409 Nuernberg
> GF: Felix Imendörffer, HRB 247165 (AG Nürnberg)
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs-snapshots causing mds failover, hangs

2019-08-27 Thread thoralf schulze
hi Zheng,

On 8/26/19 3:31 PM, Yan, Zheng wrote:

[…]
> change code to :
[…]

we can happily confirm that this resolves the issue.

thank you _very_ much & with kind regards,
t.



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSDs keep crashing in BlueStore.cc: 8808: FAILED assert(r == 0)

2019-08-27 Thread Stefan Priebe - Profihost AG
see inline

Am 27.08.19 um 15:43 schrieb Igor Fedotov:
> see inline
> 
> On 8/27/2019 4:41 PM, Stefan Priebe - Profihost AG wrote:
>> Hi Igor,
>>
>> Am 27.08.19 um 14:11 schrieb Igor Fedotov:
>>> Hi Stefan,
>>>
>>> this looks like a duplicate for
>>>
>>> https://tracker.ceph.com/issues/37282
>>>
>>> Actually the root cause selection might be quite wide.
>>>
>>>  From HW issues to broken logic in RocksDB/BlueStore/BlueFS etc.
>>>
>>> As far as I understand you have different OSDs which are failing, right?
>> Yes i've seen this on around 50 different OSDs running different HW but
>> all run ceph 12.2.12. I've not seen this with 12.2.10 which we were
>> running before.
>>
>>> Is the set of these broken OSDs limited somehow?
>> No at least i'm not able to find
>>
>>
>>> Any specific subset which is failing or something? E.g. just N of them
>>> are failing from time to time.
>> No seems totally random.
>>
>>> Any similarities for broken OSDs (e.g. specific hardware)?
>> All run intel xeon CPUs and all run linux ;-)
>>
>>> Did you run fsck for any of broken OSDs? Any reports?
>> Yes but no reports.
> Are you saying that fsck is fine for OSDs that showed this sort of errors?

Yes fsck does not show a single error - everything is fine.

>>> Any other errors/crashes in logs before these sort of issues happens?
>> No
>>
>>
>>> Just in case - what allocator are you using?
>> tcmalloc
> I meant BlueStore allocator - is it stupid or bitmap?

ah the default one i think this is stupid.

Greets,
Stefan

>>
>> Greets,
>> Stefan
>>
>>> Thanks,
>>>
>>> Igor
>>>
>>>
>>>
>>> On 8/27/2019 1:03 PM, Stefan Priebe - Profihost AG wrote:
 Hello,

 since some month all our bluestore OSDs keep crashing from time to
 time.
 Currently about 5 OSDs per day.

 All of them show the following trace:
 Trace:
 2019-07-24 08:36:48.995397 7fb19a711700 -1 rocksdb: submit_transaction
 error: Corruption: block checksum mismatch code = 2 Rocksdb
 transaction:
 Put( Prefix = M key =
 0x09a5'.916366.74680351' Value size = 184)
 Put( Prefix = M key = 0x09a5'._fastinfo' Value size = 186)
 Put( Prefix = O key =
 0x7f8003bb605f'd!rbd_data.afe49a6b8b4567.3c11!='0xfffe6f0024'x'


 Value size = 530)
 Put( Prefix = O key =
 0x7f8003bb605f'd!rbd_data.afe49a6b8b4567.3c11!='0xfffe'o'


 Value size = 510)
 Put( Prefix = L key = 0x10ba60f1 Value size = 4135)
 2019-07-24 08:36:49.012110 7fb19a711700 -1
 /build/ceph/src/os/bluestore/BlueStore.cc: In function 'void
 BlueStore::_kv_sync_thread()' thread 7fb19a711700 time 2019-07-24
 08:36:48.995415
 /build/ceph/src/os/bluestore/BlueStore.cc: 8808: FAILED assert(r == 0)

 ceph version 12.2.12-7-g1321c5e91f
 (1321c5e91f3d5d35dd5aa5a0029a54b9a8ab9498) luminous (stable)
    1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
 const*)+0x102) [0x5653a010e222]
    2: (BlueStore::_kv_sync_thread()+0x24c5) [0x56539ff964b5]
    3: (BlueStore::KVSyncThread::entry()+0xd) [0x56539ffd708d]
    4: (()+0x7494) [0x7fb1ab2f6494]
    5: (clone()+0x3f) [0x7fb1aa37dacf]

 I already opend up a tracker:
 https://tracker.ceph.com/issues/41367

 Can anybody help? Is this known?

 Greets,
 Stefan
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iostat and dashboard freezing

2019-08-27 Thread John Hearns
Try running  gstack  on the ceph mgr process when it is frozen?
This could be a name resolution problem, as you suspect. Maybe gstack will
show where the process is 'stuck'and this might be a call to your name
resolution service.

On Tue, 27 Aug 2019 at 14:25, Jake Grimmett  wrote:

> Whoops, I'm running Scientific Linux 7.6, going to upgrade to 7.7. soon...
>
> thanks
>
> Jake
>
>
> On 8/27/19 2:22 PM, Jake Grimmett wrote:
> > Hi Reed,
> >
> > That exactly matches what I'm seeing:
> >
> > when iostat is working OK, I see ~5% CPU use by ceph-mgr
> > and when iostat freezes, ceph-mgr CPU increases to 100%
> >
> > regarding OS, I'm using Scientific Linux 7.7
> > Kernel 3.10.0-957.21.3.el7.x86_64
> >
> > I'm not sure if the mgr initiates scrubbing, but if so, this could be
> > the cause of the "HEALTH_WARN 20 pgs not deep-scrubbed in time" that we
> see.
> >
> > Anyhow, many thanks for your input, please let me know if you have
> > further ideas :)
> >
> > best,
> >
> > Jake
> >
> > On 8/27/19 2:01 PM, Reed Dier wrote:
> >> Curious what dist you're running on, as I've been having similar issues
> with instability in the mgr as well, curious if any similar threads to pull
> at.
> >>
> >> While the iostat command is running, is the active mgr using 100% CPU
> in top?
> >>
> >> Reed
> >>
> >>> On Aug 27, 2019, at 6:41 AM, Jake Grimmett 
> wrote:
> >>>
> >>> Dear All,
> >>>
> >>> We have a new Nautilus (14.2.2) cluster, with 328 OSDs spread over 40
> nodes.
> >>>
> >>> Unfortunately "ceph iostat" spends most of it's time frozen, with
> >>> occasional periods of working normally for less than a minute, then
> >>> freeze again for a couple of minutes, then come back to life, and so so
> >>> on...
> >>>
> >>> No errors are seen on screen, unless I press CTRL+C when iostat is
> stalled:
> >>>
> >>> [root@ceph-s3 ~]# ceph iostat
> >>> ^CInterrupted
> >>> Traceback (most recent call last):
> >>>  File "/usr/bin/ceph", line 1263, in 
> >>>retval = main()
> >>>  File "/usr/bin/ceph", line 1194, in main
> >>>verbose)
> >>>  File "/usr/bin/ceph", line 619, in new_style_command
> >>>ret, outbuf, outs = do_command(parsed_args, target, cmdargs,
> >>> sigdict, inbuf, verbose)
> >>>  File "/usr/bin/ceph", line 593, in do_command
> >>>return ret, '', ''
> >>> UnboundLocalError: local variable 'ret' referenced before assignment
> >>>
> >>> Observations:
> >>>
> >>> 1) This problem does not seem to be related to load on the cluster.
> >>>
> >>> 2) When iostat is stalled the dashboard is also non-responsive, if
> >>> iostat is working, the dashboard also works.
> >>>
> >>> Presumably the iostat and dashboard problems are due to the same
> >>> underlying fault? Perhaps a problem with the mgr?
> >>>
> >>>
> >>> 3) With iostat working, tailing /var/log/ceph/ceph-mgr.ceph-s3.log
> >>> shows:
> >>>
> >>> 2019-08-27 09:09:56.817 7f8149834700  0 log_channel(audit) log [DBG] :
> >>> from='client.4120202 -' entity='client.admin' cmd=[{"width": 95,
> >>> "prefix": "iostat", "poll": true, "target": ["mgr", ""],
> "print_header":
> >>> false}]: dispatch
> >>>
> >>> 4) When iostat isn't working, we see no obvious errors in the mgr log.
> >>>
> >>> 5) When the dashboard is not working, mgr log sometimes shows:
> >>>
> >>> 2019-08-27 09:18:18.810 7f813e533700  0 mgr[dashboard]
> >>> [:::10.91.192.36:43606] [GET] [500] [2.724s] [jake] [1.6K]
> >>> /api/health/minimal
> >>> 2019-08-27 09:18:18.887 7f813e533700  0 mgr[dashboard] ['{"status":
> "500
> >>> Internal Server Error", "version": "3.2.2", "detail": "The server
> >>> encountered an unexpected condition which prevented it from fulfilling
> >>> the request.", "traceback": "Traceback (most recent call last):\\n
> File
> >>> \\"/usr/lib/python2.7/site-packages/cherrypy/_cprequest.py\\", line
> 656,
> >>> in respond\\nresponse.body = self.handler()\\n  File
> >>> \\"/usr/lib/python2.7/site-packages/cherrypy/lib/encoding.py\\", line
> >>> 188, in __call__\\nself.body = self.oldhandler(*args, **kwargs)\\n
> >>> File \\"/usr/lib/python2.7/site-packages/cherrypy/_cptools.py\\", line
> >>> 221, in wrap\\nreturn self.newhandler(innerfunc, *args,
> **kwargs)\\n
> >>> File \\"/usr/share/ceph/mgr/dashboard/services/exception.py\\", line
> >>> 88, in dashboard_exception_handler\\nreturn handler(*args,
> >>> **kwargs)\\n  File
> >>> \\"/usr/lib/python2.7/site-packages/cherrypy/_cpdispatch.py\\", line
> 34,
> >>> in __call__\\nreturn self.callable(*self.args, **self.kwargs)\\n
> >>> File \\"/usr/share/ceph/mgr/dashboard/controllers/__init__.py\\", line
> >>> 649, in inner\\nret = func(*args, **kwargs)\\n  File
> >>> \\"/usr/share/ceph/mgr/dashboard/controllers/health.py\\", line 192, in
> >>> minimal\\nreturn self.health_minimal.all_health()\\n  File
> >>> \\"/usr/share/ceph/mgr/dashboard/controllers/health.py\\", line 51, in
> >>> all_health\\nresult[\'pools\'] = self.pools()\\n  File
> >>> \\"/usr/share/ceph/mgr/dashboard/controllers/health.py\\", line 

Re: [ceph-users] Bluestore OSDs keep crashing in BlueStore.cc: 8808: FAILED assert(r == 0)

2019-08-27 Thread Igor Fedotov

see inline

On 8/27/2019 4:41 PM, Stefan Priebe - Profihost AG wrote:

Hi Igor,

Am 27.08.19 um 14:11 schrieb Igor Fedotov:

Hi Stefan,

this looks like a duplicate for

https://tracker.ceph.com/issues/37282

Actually the root cause selection might be quite wide.

 From HW issues to broken logic in RocksDB/BlueStore/BlueFS etc.

As far as I understand you have different OSDs which are failing, right?

Yes i've seen this on around 50 different OSDs running different HW but
all run ceph 12.2.12. I've not seen this with 12.2.10 which we were
running before.


Is the set of these broken OSDs limited somehow?

No at least i'm not able to find



Any specific subset which is failing or something? E.g. just N of them
are failing from time to time.

No seems totally random.


Any similarities for broken OSDs (e.g. specific hardware)?

All run intel xeon CPUs and all run linux ;-)


Did you run fsck for any of broken OSDs? Any reports?

Yes but no reports.

Are you saying that fsck is fine for OSDs that showed this sort of errors?




Any other errors/crashes in logs before these sort of issues happens?

No



Just in case - what allocator are you using?

tcmalloc

I meant BlueStore allocator - is it stupid or bitmap?


Greets,
Stefan


Thanks,

Igor



On 8/27/2019 1:03 PM, Stefan Priebe - Profihost AG wrote:

Hello,

since some month all our bluestore OSDs keep crashing from time to time.
Currently about 5 OSDs per day.

All of them show the following trace:
Trace:
2019-07-24 08:36:48.995397 7fb19a711700 -1 rocksdb: submit_transaction
error: Corruption: block checksum mismatch code = 2 Rocksdb transaction:
Put( Prefix = M key =
0x09a5'.916366.74680351' Value size = 184)
Put( Prefix = M key = 0x09a5'._fastinfo' Value size = 186)
Put( Prefix = O key =
0x7f8003bb605f'd!rbd_data.afe49a6b8b4567.3c11!='0xfffe6f0024'x'

Value size = 530)
Put( Prefix = O key =
0x7f8003bb605f'd!rbd_data.afe49a6b8b4567.3c11!='0xfffe'o'

Value size = 510)
Put( Prefix = L key = 0x10ba60f1 Value size = 4135)
2019-07-24 08:36:49.012110 7fb19a711700 -1
/build/ceph/src/os/bluestore/BlueStore.cc: In function 'void
BlueStore::_kv_sync_thread()' thread 7fb19a711700 time 2019-07-24
08:36:48.995415
/build/ceph/src/os/bluestore/BlueStore.cc: 8808: FAILED assert(r == 0)

ceph version 12.2.12-7-g1321c5e91f
(1321c5e91f3d5d35dd5aa5a0029a54b9a8ab9498) luminous (stable)
   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x5653a010e222]
   2: (BlueStore::_kv_sync_thread()+0x24c5) [0x56539ff964b5]
   3: (BlueStore::KVSyncThread::entry()+0xd) [0x56539ffd708d]
   4: (()+0x7494) [0x7fb1ab2f6494]
   5: (clone()+0x3f) [0x7fb1aa37dacf]

I already opend up a tracker:
https://tracker.ceph.com/issues/41367

Can anybody help? Is this known?

Greets,
Stefan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSDs keep crashing in BlueStore.cc: 8808: FAILED assert(r == 0)

2019-08-27 Thread Stefan Priebe - Profihost AG
Hi Igor,

Am 27.08.19 um 14:11 schrieb Igor Fedotov:
> Hi Stefan,
> 
> this looks like a duplicate for
> 
> https://tracker.ceph.com/issues/37282
> 
> Actually the root cause selection might be quite wide.
> 
> From HW issues to broken logic in RocksDB/BlueStore/BlueFS etc.
> 
> As far as I understand you have different OSDs which are failing, right?

Yes i've seen this on around 50 different OSDs running different HW but
all run ceph 12.2.12. I've not seen this with 12.2.10 which we were
running before.

> Is the set of these broken OSDs limited somehow?
No at least i'm not able to find


> Any specific subset which is failing or something? E.g. just N of them
> are failing from time to time.

No seems totally random.

> Any similarities for broken OSDs (e.g. specific hardware)?

All run intel xeon CPUs and all run linux ;-)

> Did you run fsck for any of broken OSDs? Any reports?

Yes but no reports.


> Any other errors/crashes in logs before these sort of issues happens?

No


> Just in case - what allocator are you using?

tcmalloc

Greets,
Stefan

> 
> Thanks,
> 
> Igor
> 
> 
> 
> On 8/27/2019 1:03 PM, Stefan Priebe - Profihost AG wrote:
>> Hello,
>>
>> since some month all our bluestore OSDs keep crashing from time to time.
>> Currently about 5 OSDs per day.
>>
>> All of them show the following trace:
>> Trace:
>> 2019-07-24 08:36:48.995397 7fb19a711700 -1 rocksdb: submit_transaction
>> error: Corruption: block checksum mismatch code = 2 Rocksdb transaction:
>> Put( Prefix = M key =
>> 0x09a5'.916366.74680351' Value size = 184)
>> Put( Prefix = M key = 0x09a5'._fastinfo' Value size = 186)
>> Put( Prefix = O key =
>> 0x7f8003bb605f'd!rbd_data.afe49a6b8b4567.3c11!='0xfffe6f0024'x'
>>
>> Value size = 530)
>> Put( Prefix = O key =
>> 0x7f8003bb605f'd!rbd_data.afe49a6b8b4567.3c11!='0xfffe'o'
>>
>> Value size = 510)
>> Put( Prefix = L key = 0x10ba60f1 Value size = 4135)
>> 2019-07-24 08:36:49.012110 7fb19a711700 -1
>> /build/ceph/src/os/bluestore/BlueStore.cc: In function 'void
>> BlueStore::_kv_sync_thread()' thread 7fb19a711700 time 2019-07-24
>> 08:36:48.995415
>> /build/ceph/src/os/bluestore/BlueStore.cc: 8808: FAILED assert(r == 0)
>>
>> ceph version 12.2.12-7-g1321c5e91f
>> (1321c5e91f3d5d35dd5aa5a0029a54b9a8ab9498) luminous (stable)
>>   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x102) [0x5653a010e222]
>>   2: (BlueStore::_kv_sync_thread()+0x24c5) [0x56539ff964b5]
>>   3: (BlueStore::KVSyncThread::entry()+0xd) [0x56539ffd708d]
>>   4: (()+0x7494) [0x7fb1ab2f6494]
>>   5: (clone()+0x3f) [0x7fb1aa37dacf]
>>
>> I already opend up a tracker:
>> https://tracker.ceph.com/issues/41367
>>
>> Can anybody help? Is this known?
>>
>> Greets,
>> Stefan
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iostat and dashboard freezing

2019-08-27 Thread Lenz Grimmer
Hi Jake,

On 8/27/19 3:22 PM, Jake Grimmett wrote:

> That exactly matches what I'm seeing:
> 
> when iostat is working OK, I see ~5% CPU use by ceph-mgr
> and when iostat freezes, ceph-mgr CPU increases to 100%

Does this also occur if the dashboard module is disabled? Just wondering
if this is isolatable to the iostat module. Thanks!

Lenz

-- 
SUSE Software Solutions Germany GmbH - Maxfeldstr. 5 - 90409 Nuernberg
GF: Felix Imendörffer, HRB 247165 (AG Nürnberg)



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iostat and dashboard freezing

2019-08-27 Thread Jake Grimmett
Whoops, I'm running Scientific Linux 7.6, going to upgrade to 7.7. soon...

thanks

Jake


On 8/27/19 2:22 PM, Jake Grimmett wrote:
> Hi Reed,
> 
> That exactly matches what I'm seeing:
> 
> when iostat is working OK, I see ~5% CPU use by ceph-mgr
> and when iostat freezes, ceph-mgr CPU increases to 100%
> 
> regarding OS, I'm using Scientific Linux 7.7
> Kernel 3.10.0-957.21.3.el7.x86_64
> 
> I'm not sure if the mgr initiates scrubbing, but if so, this could be
> the cause of the "HEALTH_WARN 20 pgs not deep-scrubbed in time" that we see.
> 
> Anyhow, many thanks for your input, please let me know if you have
> further ideas :)
> 
> best,
> 
> Jake
> 
> On 8/27/19 2:01 PM, Reed Dier wrote:
>> Curious what dist you're running on, as I've been having similar issues with 
>> instability in the mgr as well, curious if any similar threads to pull at.
>>
>> While the iostat command is running, is the active mgr using 100% CPU in top?
>>
>> Reed
>>
>>> On Aug 27, 2019, at 6:41 AM, Jake Grimmett  wrote:
>>>
>>> Dear All,
>>>
>>> We have a new Nautilus (14.2.2) cluster, with 328 OSDs spread over 40 nodes.
>>>
>>> Unfortunately "ceph iostat" spends most of it's time frozen, with
>>> occasional periods of working normally for less than a minute, then
>>> freeze again for a couple of minutes, then come back to life, and so so
>>> on...
>>>
>>> No errors are seen on screen, unless I press CTRL+C when iostat is stalled:
>>>
>>> [root@ceph-s3 ~]# ceph iostat
>>> ^CInterrupted
>>> Traceback (most recent call last):
>>>  File "/usr/bin/ceph", line 1263, in 
>>>retval = main()
>>>  File "/usr/bin/ceph", line 1194, in main
>>>verbose)
>>>  File "/usr/bin/ceph", line 619, in new_style_command
>>>ret, outbuf, outs = do_command(parsed_args, target, cmdargs,
>>> sigdict, inbuf, verbose)
>>>  File "/usr/bin/ceph", line 593, in do_command
>>>return ret, '', ''
>>> UnboundLocalError: local variable 'ret' referenced before assignment
>>>
>>> Observations:
>>>
>>> 1) This problem does not seem to be related to load on the cluster.
>>>
>>> 2) When iostat is stalled the dashboard is also non-responsive, if
>>> iostat is working, the dashboard also works.
>>>
>>> Presumably the iostat and dashboard problems are due to the same
>>> underlying fault? Perhaps a problem with the mgr?
>>>
>>>
>>> 3) With iostat working, tailing /var/log/ceph/ceph-mgr.ceph-s3.log
>>> shows:
>>>
>>> 2019-08-27 09:09:56.817 7f8149834700  0 log_channel(audit) log [DBG] :
>>> from='client.4120202 -' entity='client.admin' cmd=[{"width": 95,
>>> "prefix": "iostat", "poll": true, "target": ["mgr", ""], "print_header":
>>> false}]: dispatch
>>>
>>> 4) When iostat isn't working, we see no obvious errors in the mgr log.
>>>
>>> 5) When the dashboard is not working, mgr log sometimes shows:
>>>
>>> 2019-08-27 09:18:18.810 7f813e533700  0 mgr[dashboard]
>>> [:::10.91.192.36:43606] [GET] [500] [2.724s] [jake] [1.6K]
>>> /api/health/minimal
>>> 2019-08-27 09:18:18.887 7f813e533700  0 mgr[dashboard] ['{"status": "500
>>> Internal Server Error", "version": "3.2.2", "detail": "The server
>>> encountered an unexpected condition which prevented it from fulfilling
>>> the request.", "traceback": "Traceback (most recent call last):\\n  File
>>> \\"/usr/lib/python2.7/site-packages/cherrypy/_cprequest.py\\", line 656,
>>> in respond\\nresponse.body = self.handler()\\n  File
>>> \\"/usr/lib/python2.7/site-packages/cherrypy/lib/encoding.py\\", line
>>> 188, in __call__\\nself.body = self.oldhandler(*args, **kwargs)\\n
>>> File \\"/usr/lib/python2.7/site-packages/cherrypy/_cptools.py\\", line
>>> 221, in wrap\\nreturn self.newhandler(innerfunc, *args, **kwargs)\\n
>>> File \\"/usr/share/ceph/mgr/dashboard/services/exception.py\\", line
>>> 88, in dashboard_exception_handler\\nreturn handler(*args,
>>> **kwargs)\\n  File
>>> \\"/usr/lib/python2.7/site-packages/cherrypy/_cpdispatch.py\\", line 34,
>>> in __call__\\nreturn self.callable(*self.args, **self.kwargs)\\n
>>> File \\"/usr/share/ceph/mgr/dashboard/controllers/__init__.py\\", line
>>> 649, in inner\\nret = func(*args, **kwargs)\\n  File
>>> \\"/usr/share/ceph/mgr/dashboard/controllers/health.py\\", line 192, in
>>> minimal\\nreturn self.health_minimal.all_health()\\n  File
>>> \\"/usr/share/ceph/mgr/dashboard/controllers/health.py\\", line 51, in
>>> all_health\\nresult[\'pools\'] = self.pools()\\n  File
>>> \\"/usr/share/ceph/mgr/dashboard/controllers/health.py\\", line 167, in
>>> pools\\npools = CephService.get_pool_list_with_stats()\\n  File
>>> \\"/usr/share/ceph/mgr/dashboard/services/ceph_service.py\\", line 124,
>>> in get_pool_list_with_stats\\n\'series\': [i for i in
>>> stat_series]\\nRuntimeError: deque mutated during iteration\\n"}']
>>>
>>>
>>> 6) IPV6 is normally disabled on our machines at the kernel level, via
>>> grubby --update-kernel=ALL --args="ipv6.disable=1"
>>>
>>> This was done as 'disabling ipv6' interfered with the dashboard (giving
>>> 

Re: [ceph-users] iostat and dashboard freezing

2019-08-27 Thread Jake Grimmett
Hi Reed,

That exactly matches what I'm seeing:

when iostat is working OK, I see ~5% CPU use by ceph-mgr
and when iostat freezes, ceph-mgr CPU increases to 100%

regarding OS, I'm using Scientific Linux 7.7
Kernel 3.10.0-957.21.3.el7.x86_64

I'm not sure if the mgr initiates scrubbing, but if so, this could be
the cause of the "HEALTH_WARN 20 pgs not deep-scrubbed in time" that we see.

Anyhow, many thanks for your input, please let me know if you have
further ideas :)

best,

Jake

On 8/27/19 2:01 PM, Reed Dier wrote:
> Curious what dist you're running on, as I've been having similar issues with 
> instability in the mgr as well, curious if any similar threads to pull at.
> 
> While the iostat command is running, is the active mgr using 100% CPU in top?
> 
> Reed
> 
>> On Aug 27, 2019, at 6:41 AM, Jake Grimmett  wrote:
>>
>> Dear All,
>>
>> We have a new Nautilus (14.2.2) cluster, with 328 OSDs spread over 40 nodes.
>>
>> Unfortunately "ceph iostat" spends most of it's time frozen, with
>> occasional periods of working normally for less than a minute, then
>> freeze again for a couple of minutes, then come back to life, and so so
>> on...
>>
>> No errors are seen on screen, unless I press CTRL+C when iostat is stalled:
>>
>> [root@ceph-s3 ~]# ceph iostat
>> ^CInterrupted
>> Traceback (most recent call last):
>>  File "/usr/bin/ceph", line 1263, in 
>>retval = main()
>>  File "/usr/bin/ceph", line 1194, in main
>>verbose)
>>  File "/usr/bin/ceph", line 619, in new_style_command
>>ret, outbuf, outs = do_command(parsed_args, target, cmdargs,
>> sigdict, inbuf, verbose)
>>  File "/usr/bin/ceph", line 593, in do_command
>>return ret, '', ''
>> UnboundLocalError: local variable 'ret' referenced before assignment
>>
>> Observations:
>>
>> 1) This problem does not seem to be related to load on the cluster.
>>
>> 2) When iostat is stalled the dashboard is also non-responsive, if
>> iostat is working, the dashboard also works.
>>
>> Presumably the iostat and dashboard problems are due to the same
>> underlying fault? Perhaps a problem with the mgr?
>>
>>
>> 3) With iostat working, tailing /var/log/ceph/ceph-mgr.ceph-s3.log
>> shows:
>>
>> 2019-08-27 09:09:56.817 7f8149834700  0 log_channel(audit) log [DBG] :
>> from='client.4120202 -' entity='client.admin' cmd=[{"width": 95,
>> "prefix": "iostat", "poll": true, "target": ["mgr", ""], "print_header":
>> false}]: dispatch
>>
>> 4) When iostat isn't working, we see no obvious errors in the mgr log.
>>
>> 5) When the dashboard is not working, mgr log sometimes shows:
>>
>> 2019-08-27 09:18:18.810 7f813e533700  0 mgr[dashboard]
>> [:::10.91.192.36:43606] [GET] [500] [2.724s] [jake] [1.6K]
>> /api/health/minimal
>> 2019-08-27 09:18:18.887 7f813e533700  0 mgr[dashboard] ['{"status": "500
>> Internal Server Error", "version": "3.2.2", "detail": "The server
>> encountered an unexpected condition which prevented it from fulfilling
>> the request.", "traceback": "Traceback (most recent call last):\\n  File
>> \\"/usr/lib/python2.7/site-packages/cherrypy/_cprequest.py\\", line 656,
>> in respond\\nresponse.body = self.handler()\\n  File
>> \\"/usr/lib/python2.7/site-packages/cherrypy/lib/encoding.py\\", line
>> 188, in __call__\\nself.body = self.oldhandler(*args, **kwargs)\\n
>> File \\"/usr/lib/python2.7/site-packages/cherrypy/_cptools.py\\", line
>> 221, in wrap\\nreturn self.newhandler(innerfunc, *args, **kwargs)\\n
>> File \\"/usr/share/ceph/mgr/dashboard/services/exception.py\\", line
>> 88, in dashboard_exception_handler\\nreturn handler(*args,
>> **kwargs)\\n  File
>> \\"/usr/lib/python2.7/site-packages/cherrypy/_cpdispatch.py\\", line 34,
>> in __call__\\nreturn self.callable(*self.args, **self.kwargs)\\n
>> File \\"/usr/share/ceph/mgr/dashboard/controllers/__init__.py\\", line
>> 649, in inner\\nret = func(*args, **kwargs)\\n  File
>> \\"/usr/share/ceph/mgr/dashboard/controllers/health.py\\", line 192, in
>> minimal\\nreturn self.health_minimal.all_health()\\n  File
>> \\"/usr/share/ceph/mgr/dashboard/controllers/health.py\\", line 51, in
>> all_health\\nresult[\'pools\'] = self.pools()\\n  File
>> \\"/usr/share/ceph/mgr/dashboard/controllers/health.py\\", line 167, in
>> pools\\npools = CephService.get_pool_list_with_stats()\\n  File
>> \\"/usr/share/ceph/mgr/dashboard/services/ceph_service.py\\", line 124,
>> in get_pool_list_with_stats\\n\'series\': [i for i in
>> stat_series]\\nRuntimeError: deque mutated during iteration\\n"}']
>>
>>
>> 6) IPV6 is normally disabled on our machines at the kernel level, via
>> grubby --update-kernel=ALL --args="ipv6.disable=1"
>>
>> This was done as 'disabling ipv6' interfered with the dashboard (giving
>> "HEALTH_ERR Module 'dashboard' has failed: error('No socket could be
>> created',) we re-enabling ipv6 on the mgr nodes only to fix this.
>>
>>
>> Ideas...?
>>
>> Should ipv6 be enabled, even if not configured, on all ceph nodes?
>>
>> Any ideas on fixing this 

[ceph-users] health: HEALTH_ERR Module 'devicehealth' has failed: Failed to import _strptime because the import lockis held by another thread.

2019-08-27 Thread Peter Eisch
Hi,

What is the correct/best way to address a this?  It seems like a python issue, 
maybe it's time I learn how to "restart" modules?  The cluster seems to be 
working beyond this.

    health: HEALTH_ERR
            Module 'devicehealth' has failed: Failed to import _strptime 
because the import lockis held by another thread.



CEPH: Nautilus 14.2.2
3 - mons
3 - mgrs.
3 - mds
Full status:

  cluster:
id: 2fdb5976-1a38-4b29-1234-1ca74a9466ec
health: HEALTH_ERR
Module 'devicehealth' has failed: Failed to import _strptime 
because the import lockis held by another thread.

  services:
mon: 3 daemons, quorum cephmon01,cephmon02,cephmon03 (age 33m)
mgr: cephmon01(active, since 2h), standbys: cephmon02, cephmon03
mds: cephfs1:1 {0=cephmds-a03=up:active} 2 up:standby
osd: 103 osds: 103 up, 103 in
rgw: 3 daemons active (cephrgw-a01, cephrgw-a02, cephrgw-a03)

  data:
pools:   18 pools, 4901 pgs
objects: 4.28M objects, 16 TiB
usage:   49 TiB used, 97 TiB / 146 TiB avail
pgs: 4901 active+clean

  io:
client:   7.4 KiB/s rd, 24 MiB/s wr, 7 op/s rd, 628 op/s wr



Peter Eisch
Senior Site Reliability Engineer
T1.612.659.3228
virginpulse.com
|virginpulse.com/global-challenge
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.
v2.60
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iostat and dashboard freezing

2019-08-27 Thread Reed Dier
Curious what dist you're running on, as I've been having similar issues with 
instability in the mgr as well, curious if any similar threads to pull at.

While the iostat command is running, is the active mgr using 100% CPU in top?

Reed

> On Aug 27, 2019, at 6:41 AM, Jake Grimmett  wrote:
> 
> Dear All,
> 
> We have a new Nautilus (14.2.2) cluster, with 328 OSDs spread over 40 nodes.
> 
> Unfortunately "ceph iostat" spends most of it's time frozen, with
> occasional periods of working normally for less than a minute, then
> freeze again for a couple of minutes, then come back to life, and so so
> on...
> 
> No errors are seen on screen, unless I press CTRL+C when iostat is stalled:
> 
> [root@ceph-s3 ~]# ceph iostat
> ^CInterrupted
> Traceback (most recent call last):
>  File "/usr/bin/ceph", line 1263, in 
>retval = main()
>  File "/usr/bin/ceph", line 1194, in main
>verbose)
>  File "/usr/bin/ceph", line 619, in new_style_command
>ret, outbuf, outs = do_command(parsed_args, target, cmdargs,
> sigdict, inbuf, verbose)
>  File "/usr/bin/ceph", line 593, in do_command
>return ret, '', ''
> UnboundLocalError: local variable 'ret' referenced before assignment
> 
> Observations:
> 
> 1) This problem does not seem to be related to load on the cluster.
> 
> 2) When iostat is stalled the dashboard is also non-responsive, if
> iostat is working, the dashboard also works.
> 
> Presumably the iostat and dashboard problems are due to the same
> underlying fault? Perhaps a problem with the mgr?
> 
> 
> 3) With iostat working, tailing /var/log/ceph/ceph-mgr.ceph-s3.log
> shows:
> 
> 2019-08-27 09:09:56.817 7f8149834700  0 log_channel(audit) log [DBG] :
> from='client.4120202 -' entity='client.admin' cmd=[{"width": 95,
> "prefix": "iostat", "poll": true, "target": ["mgr", ""], "print_header":
> false}]: dispatch
> 
> 4) When iostat isn't working, we see no obvious errors in the mgr log.
> 
> 5) When the dashboard is not working, mgr log sometimes shows:
> 
> 2019-08-27 09:18:18.810 7f813e533700  0 mgr[dashboard]
> [:::10.91.192.36:43606] [GET] [500] [2.724s] [jake] [1.6K]
> /api/health/minimal
> 2019-08-27 09:18:18.887 7f813e533700  0 mgr[dashboard] ['{"status": "500
> Internal Server Error", "version": "3.2.2", "detail": "The server
> encountered an unexpected condition which prevented it from fulfilling
> the request.", "traceback": "Traceback (most recent call last):\\n  File
> \\"/usr/lib/python2.7/site-packages/cherrypy/_cprequest.py\\", line 656,
> in respond\\nresponse.body = self.handler()\\n  File
> \\"/usr/lib/python2.7/site-packages/cherrypy/lib/encoding.py\\", line
> 188, in __call__\\nself.body = self.oldhandler(*args, **kwargs)\\n
> File \\"/usr/lib/python2.7/site-packages/cherrypy/_cptools.py\\", line
> 221, in wrap\\nreturn self.newhandler(innerfunc, *args, **kwargs)\\n
> File \\"/usr/share/ceph/mgr/dashboard/services/exception.py\\", line
> 88, in dashboard_exception_handler\\nreturn handler(*args,
> **kwargs)\\n  File
> \\"/usr/lib/python2.7/site-packages/cherrypy/_cpdispatch.py\\", line 34,
> in __call__\\nreturn self.callable(*self.args, **self.kwargs)\\n
> File \\"/usr/share/ceph/mgr/dashboard/controllers/__init__.py\\", line
> 649, in inner\\nret = func(*args, **kwargs)\\n  File
> \\"/usr/share/ceph/mgr/dashboard/controllers/health.py\\", line 192, in
> minimal\\nreturn self.health_minimal.all_health()\\n  File
> \\"/usr/share/ceph/mgr/dashboard/controllers/health.py\\", line 51, in
> all_health\\nresult[\'pools\'] = self.pools()\\n  File
> \\"/usr/share/ceph/mgr/dashboard/controllers/health.py\\", line 167, in
> pools\\npools = CephService.get_pool_list_with_stats()\\n  File
> \\"/usr/share/ceph/mgr/dashboard/services/ceph_service.py\\", line 124,
> in get_pool_list_with_stats\\n\'series\': [i for i in
> stat_series]\\nRuntimeError: deque mutated during iteration\\n"}']
> 
> 
> 6) IPV6 is normally disabled on our machines at the kernel level, via
> grubby --update-kernel=ALL --args="ipv6.disable=1"
> 
> This was done as 'disabling ipv6' interfered with the dashboard (giving
> "HEALTH_ERR Module 'dashboard' has failed: error('No socket could be
> created',) we re-enabling ipv6 on the mgr nodes only to fix this.
> 
> 
> Ideas...?
> 
> Should ipv6 be enabled, even if not configured, on all ceph nodes?
> 
> Any ideas on fixing this gratefully received!
> 
> many thanks
> 
> Jake
> 
> -- 
> MRC Laboratory of Molecular Biology
> Francis Crick Avenue,
> Cambridge CB2 0QH, UK.
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSDs keep crashing in BlueStore.cc: 8808: FAILED assert(r == 0)

2019-08-27 Thread Igor Fedotov

Hi Stefan,

this looks like a duplicate for

https://tracker.ceph.com/issues/37282


Actually the root cause selection might be quite wide.

From HW issues to broken logic in RocksDB/BlueStore/BlueFS etc.

As far as I understand you have different OSDs which are failing, right? 
Is the set of these broken OSDs limited somehow?


Any specific subset which is failing or something? E.g. just N of them 
are failing from time to time.


Any similarities for broken OSDs (e.g. specific hardware)?


Did you run fsck for any of broken OSDs? Any reports?

Any other errors/crashes in logs before these sort of issues happens?


Just in case - what allocator are you using?


Thanks,

Igor



On 8/27/2019 1:03 PM, Stefan Priebe - Profihost AG wrote:

Hello,

since some month all our bluestore OSDs keep crashing from time to time.
Currently about 5 OSDs per day.

All of them show the following trace:
Trace:
2019-07-24 08:36:48.995397 7fb19a711700 -1 rocksdb: submit_transaction
error: Corruption: block checksum mismatch code = 2 Rocksdb transaction:
Put( Prefix = M key =
0x09a5'.916366.74680351' Value size = 184)
Put( Prefix = M key = 0x09a5'._fastinfo' Value size = 186)
Put( Prefix = O key =
0x7f8003bb605f'd!rbd_data.afe49a6b8b4567.3c11!='0xfffe6f0024'x'
Value size = 530)
Put( Prefix = O key =
0x7f8003bb605f'd!rbd_data.afe49a6b8b4567.3c11!='0xfffe'o'
Value size = 510)
Put( Prefix = L key = 0x10ba60f1 Value size = 4135)
2019-07-24 08:36:49.012110 7fb19a711700 -1
/build/ceph/src/os/bluestore/BlueStore.cc: In function 'void
BlueStore::_kv_sync_thread()' thread 7fb19a711700 time 2019-07-24
08:36:48.995415
/build/ceph/src/os/bluestore/BlueStore.cc: 8808: FAILED assert(r == 0)

ceph version 12.2.12-7-g1321c5e91f
(1321c5e91f3d5d35dd5aa5a0029a54b9a8ab9498) luminous (stable)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x5653a010e222]
  2: (BlueStore::_kv_sync_thread()+0x24c5) [0x56539ff964b5]
  3: (BlueStore::KVSyncThread::entry()+0xd) [0x56539ffd708d]
  4: (()+0x7494) [0x7fb1ab2f6494]
  5: (clone()+0x3f) [0x7fb1aa37dacf]

I already opend up a tracker:
https://tracker.ceph.com/issues/41367

Can anybody help? Is this known?

Greets,
Stefan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] iostat and dashboard freezing

2019-08-27 Thread Jake Grimmett
Dear All,

We have a new Nautilus (14.2.2) cluster, with 328 OSDs spread over 40 nodes.

Unfortunately "ceph iostat" spends most of it's time frozen, with
occasional periods of working normally for less than a minute, then
freeze again for a couple of minutes, then come back to life, and so so
on...

No errors are seen on screen, unless I press CTRL+C when iostat is stalled:

[root@ceph-s3 ~]# ceph iostat
^CInterrupted
Traceback (most recent call last):
  File "/usr/bin/ceph", line 1263, in 
retval = main()
  File "/usr/bin/ceph", line 1194, in main
verbose)
  File "/usr/bin/ceph", line 619, in new_style_command
ret, outbuf, outs = do_command(parsed_args, target, cmdargs,
sigdict, inbuf, verbose)
  File "/usr/bin/ceph", line 593, in do_command
return ret, '', ''
UnboundLocalError: local variable 'ret' referenced before assignment

Observations:

1) This problem does not seem to be related to load on the cluster.

2) When iostat is stalled the dashboard is also non-responsive, if
iostat is working, the dashboard also works.

Presumably the iostat and dashboard problems are due to the same
underlying fault? Perhaps a problem with the mgr?


3) With iostat working, tailing /var/log/ceph/ceph-mgr.ceph-s3.log
shows:

2019-08-27 09:09:56.817 7f8149834700  0 log_channel(audit) log [DBG] :
from='client.4120202 -' entity='client.admin' cmd=[{"width": 95,
"prefix": "iostat", "poll": true, "target": ["mgr", ""], "print_header":
false}]: dispatch

4) When iostat isn't working, we see no obvious errors in the mgr log.

5) When the dashboard is not working, mgr log sometimes shows:

2019-08-27 09:18:18.810 7f813e533700  0 mgr[dashboard]
[:::10.91.192.36:43606] [GET] [500] [2.724s] [jake] [1.6K]
/api/health/minimal
2019-08-27 09:18:18.887 7f813e533700  0 mgr[dashboard] ['{"status": "500
Internal Server Error", "version": "3.2.2", "detail": "The server
encountered an unexpected condition which prevented it from fulfilling
the request.", "traceback": "Traceback (most recent call last):\\n  File
\\"/usr/lib/python2.7/site-packages/cherrypy/_cprequest.py\\", line 656,
in respond\\nresponse.body = self.handler()\\n  File
\\"/usr/lib/python2.7/site-packages/cherrypy/lib/encoding.py\\", line
188, in __call__\\nself.body = self.oldhandler(*args, **kwargs)\\n
File \\"/usr/lib/python2.7/site-packages/cherrypy/_cptools.py\\", line
221, in wrap\\nreturn self.newhandler(innerfunc, *args, **kwargs)\\n
 File \\"/usr/share/ceph/mgr/dashboard/services/exception.py\\", line
88, in dashboard_exception_handler\\nreturn handler(*args,
**kwargs)\\n  File
\\"/usr/lib/python2.7/site-packages/cherrypy/_cpdispatch.py\\", line 34,
in __call__\\nreturn self.callable(*self.args, **self.kwargs)\\n
File \\"/usr/share/ceph/mgr/dashboard/controllers/__init__.py\\", line
649, in inner\\nret = func(*args, **kwargs)\\n  File
\\"/usr/share/ceph/mgr/dashboard/controllers/health.py\\", line 192, in
minimal\\nreturn self.health_minimal.all_health()\\n  File
\\"/usr/share/ceph/mgr/dashboard/controllers/health.py\\", line 51, in
all_health\\nresult[\'pools\'] = self.pools()\\n  File
\\"/usr/share/ceph/mgr/dashboard/controllers/health.py\\", line 167, in
pools\\npools = CephService.get_pool_list_with_stats()\\n  File
\\"/usr/share/ceph/mgr/dashboard/services/ceph_service.py\\", line 124,
in get_pool_list_with_stats\\n\'series\': [i for i in
stat_series]\\nRuntimeError: deque mutated during iteration\\n"}']


6) IPV6 is normally disabled on our machines at the kernel level, via
grubby --update-kernel=ALL --args="ipv6.disable=1"

This was done as 'disabling ipv6' interfered with the dashboard (giving
"HEALTH_ERR Module 'dashboard' has failed: error('No socket could be
created',) we re-enabling ipv6 on the mgr nodes only to fix this.


Ideas...?

Should ipv6 be enabled, even if not configured, on all ceph nodes?

Any ideas on fixing this gratefully received!

many thanks

Jake

-- 
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bluestore OSDs keep crashing in BlueStore.cc: 8808: FAILED assert(r == 0)

2019-08-27 Thread Stefan Priebe - Profihost AG
Hello,

since some month all our bluestore OSDs keep crashing from time to time.
Currently about 5 OSDs per day.

All of them show the following trace:
Trace:
2019-07-24 08:36:48.995397 7fb19a711700 -1 rocksdb: submit_transaction
error: Corruption: block checksum mismatch code = 2 Rocksdb transaction:
Put( Prefix = M key =
0x09a5'.916366.74680351' Value size = 184)
Put( Prefix = M key = 0x09a5'._fastinfo' Value size = 186)
Put( Prefix = O key =
0x7f8003bb605f'd!rbd_data.afe49a6b8b4567.3c11!='0xfffe6f0024'x'
Value size = 530)
Put( Prefix = O key =
0x7f8003bb605f'd!rbd_data.afe49a6b8b4567.3c11!='0xfffe'o'
Value size = 510)
Put( Prefix = L key = 0x10ba60f1 Value size = 4135)
2019-07-24 08:36:49.012110 7fb19a711700 -1
/build/ceph/src/os/bluestore/BlueStore.cc: In function 'void
BlueStore::_kv_sync_thread()' thread 7fb19a711700 time 2019-07-24
08:36:48.995415
/build/ceph/src/os/bluestore/BlueStore.cc: 8808: FAILED assert(r == 0)

ceph version 12.2.12-7-g1321c5e91f
(1321c5e91f3d5d35dd5aa5a0029a54b9a8ab9498) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x5653a010e222]
 2: (BlueStore::_kv_sync_thread()+0x24c5) [0x56539ff964b5]
 3: (BlueStore::KVSyncThread::entry()+0xd) [0x56539ffd708d]
 4: (()+0x7494) [0x7fb1ab2f6494]
 5: (clone()+0x3f) [0x7fb1aa37dacf]

I already opend up a tracker:
https://tracker.ceph.com/issues/41367

Can anybody help? Is this known?

Greets,
Stefan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bluestore OSDs keep crashing in BlueStore.cc: 8808: FAILED assert(r == 0)

2019-08-27 Thread Stefan Priebe - Profihost AG
Hello,

since some month all our bluestore OSDs keep crashing from time to time.
Currently about 5 OSDs per day.

All of them show the following trace:
Trace:
2019-07-24 08:36:48.995397 7fb19a711700 -1 rocksdb: submit_transaction
error: Corruption: block checksum mismatch code = 2 Rocksdb transaction:
Put( Prefix = M key =
0x09a5'.916366.74680351' Value size = 184)
Put( Prefix = M key = 0x09a5'._fastinfo' Value size = 186)
Put( Prefix = O key =
0x7f8003bb605f'd!rbd_data.afe49a6b8b4567.3c11!='0xfffe6f0024'x'
Value size = 530)
Put( Prefix = O key =
0x7f8003bb605f'd!rbd_data.afe49a6b8b4567.3c11!='0xfffe'o'
Value size = 510)
Put( Prefix = L key = 0x10ba60f1 Value size = 4135)
2019-07-24 08:36:49.012110 7fb19a711700 -1
/build/ceph/src/os/bluestore/BlueStore.cc: In function 'void
BlueStore::_kv_sync_thread()' thread 7fb19a711700 time 2019-07-24
08:36:48.995415
/build/ceph/src/os/bluestore/BlueStore.cc: 8808: FAILED assert(r == 0)

ceph version 12.2.12-7-g1321c5e91f
(1321c5e91f3d5d35dd5aa5a0029a54b9a8ab9498) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x5653a010e222]
 2: (BlueStore::_kv_sync_thread()+0x24c5) [0x56539ff964b5]
 3: (BlueStore::KVSyncThread::entry()+0xd) [0x56539ffd708d]
 4: (()+0x7494) [0x7fb1ab2f6494]
 5: (clone()+0x3f) [0x7fb1aa37dacf]

I already opend up a tracker:
https://tracker.ceph.com/issues/41367

Can anybody help? Is this known?

Greets,
Stefan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com