[ceph-users] daahboard

2018-10-05 Thread solarflow99
I enabled the dashboard module in ansible but I don't see ceph-mgr
listening on a port for it.  Is there something else I missed?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

2018-10-05 Thread Sergey Malinin
Update:
I discovered http://tracker.ceph.com/issues/24236 
 and 
https://github.com/ceph/ceph/pull/22146 

Make sure that it is not relevant in your case before proceeding to operations 
that modify on-disk data.


> On 6.10.2018, at 03:17, Sergey Malinin  wrote:
> 
> I ended up rescanning the entire fs using alternate metadata pool approach as 
> in http://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/ 
> 
> The process has not competed yet because during the recovery our cluster 
> encountered another problem with OSDs that I got fixed yesterday (thanks to 
> Igor Fedotov @ SUSE).
> The first stage (scan_extents) completed in 84 hours (120M objects in data 
> pool on 8 hdd OSDs on 4 hosts). The second (scan_inodes) was interrupted by 
> OSDs failure so I have no timing stats but it seems to be runing 2-3 times 
> faster than extents scan.
> As to root cause -- in my case I recall that during upgrade I had forgotten 
> to restart 3 OSDs, one of which was holding metadata pool contents, before 
> restarting MDS daemons and that seemed to had an impact on MDS journal 
> corruption, because when I restarted those OSDs, MDS was able to start up but 
> soon failed throwing lots of 'loaded dup inode' errors.
> 
> 
>> On 6.10.2018, at 00:41, Alfredo Daniel Rezinovsky > > wrote:
>> 
>> Same problem...
>> 
>> # cephfs-journal-tool --journal=purge_queue journal inspect
>> 2018-10-05 18:37:10.704 7f01f60a9bc0 -1 Missing object 500.016c
>> Overall journal integrity: DAMAGED
>> Objects missing:
>>   0x16c
>> Corrupt regions:
>>   0x5b00-
>> 
>> Just after upgrade to 13.2.2
>> 
>> Did you fixed it?
>> 
>> 
>> On 26/09/18 13:05, Sergey Malinin wrote:
>>> Hello,
>>> Followed standard upgrade procedure to upgrade from 13.2.1 to 13.2.2.
>>> After upgrade MDS cluster is down, mds rank 0 and purge_queue journal are 
>>> damaged. Resetting purge_queue does not seem to work well as journal still 
>>> appears to be damaged.
>>> Can anybody help?
>>> 
>>> mds log:
>>> 
>>>   -789> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.mds2 Updating MDS map 
>>> to version 586 from mon.2
>>>   -788> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 handle_mds_map i 
>>> am now mds.0.583
>>>   -787> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 handle_mds_map 
>>> state change up:rejoin --> up:active
>>>   -786> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 recovery_done -- 
>>> successful recovery!
>>> 
>>>-38> 2018-09-26 18:42:32.707 7f70f28a7700 -1 mds.0.purge_queue _consume: 
>>> Decode error at read_pos=0x322ec6636
>>>-37> 2018-09-26 18:42:32.707 7f70f28a7700  5 mds.beacon.mds2 
>>> set_want_state: up:active -> down:damaged
>>>-36> 2018-09-26 18:42:32.707 7f70f28a7700  5 mds.beacon.mds2 _send 
>>> down:damaged seq 137
>>>-35> 2018-09-26 18:42:32.707 7f70f28a7700 10 monclient: 
>>> _send_mon_message to mon.ceph3 at mon:6789/0
>>>-34> 2018-09-26 18:42:32.707 7f70f28a7700  1 -- mds:6800/e4cc09cf --> 
>>> mon:6789/0 -- mdsbeacon(14c72/mds2 down:damaged seq 137 v24a) v7 -- 
>>> 0x563b321ad480 con 0
>>> 
>>> -3> 2018-09-26 18:42:32.743 7f70f98b5700  5 -- mds:6800/3838577103 >> 
>>> mon:6789/0 conn(0x563b3213e000 :-1 
>>> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=8 cs=1 l=1). rx mon.2 seq 
>>> 29 0x563b321ab880 mdsbeaco
>>> n(85106/mds2 down:damaged seq 311 v587) v7
>>> -2> 2018-09-26 18:42:32.743 7f70f98b5700  1 -- mds:6800/3838577103 <== 
>>> mon.2 mon:6789/0 29  mdsbeacon(85106/mds2 down:damaged seq 311 v587) v7 
>>>  129+0+0 (3296573291 0 0) 0x563b321ab880 con 0x563b3213e
>>> 000
>>> -1> 2018-09-26 18:42:32.743 7f70f98b5700  5 mds.beacon.mds2 
>>> handle_mds_beacon down:damaged seq 311 rtt 0.038261
>>>  0> 2018-09-26 18:42:32.743 7f70f28a7700  1 mds.mds2 respawn!
>>> 
>>> # cephfs-journal-tool --journal=purge_queue journal inspect
>>> Overall journal integrity: DAMAGED
>>> Corrupt regions:
>>>   0x322ec65d9-
>>> 
>>> # cephfs-journal-tool --journal=purge_queue journal reset
>>> old journal was 13470819801~8463
>>> new journal start will be 13472104448 (1276184 bytes past old end)
>>> writing journal head
>>> done
>>> 
>>> # cephfs-journal-tool --journal=purge_queue journal inspect
>>> 2018-09-26 19:00:52.848 7f3f9fa50bc0 -1 Missing object 500.0c8c
>>> Overall journal integrity: DAMAGED
>>> Objects missing:
>>>   0xc8c
>>> Corrupt regions:
>>>   0x32300-
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com 
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS hangs in "heartbeat_map" deadlock

2018-10-05 Thread Stefan Kooman
Quoting Gregory Farnum (gfar...@redhat.com):
> 
> Ah, there's a misunderstanding here — the output isn't terribly clear.
> "is_healthy" is the name of a *function* in the source code. The line
> 
> heartbeat_map is_healthy 'MDSRank' had timed out after 15
> 
> is telling you that the heartbeat_map's is_healthy function is running, and
> it finds that "'MDSRank' had timed out after 15 [seconds]". So the thread
> MDSRank is *not* healthy, it didn't check in for 15 seconds! Therefore the
> MDS beacon code decides not to send a beacon, because it thinks the MDS
> might be stuck.

Thanks for the explanation.

> From what you've described here, it's most likely that the MDS is trying to
> read something out of RADOS which is taking a long time, and which we
> didn't expect to cause a slow down. You can check via the admin socket to
> see if there are outstanding Objecter requests or ops_in_flight to get a
> clue.

Hmm, I avoided that because of this issue [1]. Killing the MDS while
debugging why it's hanging is defeating the purpose ;-).

I might check for "Objecter requests".

Thanks!

Gr. Stefan

[1]: http://tracker.ceph.com/issues/26894

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS hangs in "heartbeat_map" deadlock

2018-10-05 Thread Gregory Farnum
On Thu, Oct 4, 2018 at 3:58 PM Stefan Kooman  wrote:

> Dear list,
>
> Today we hit our first Ceph MDS issue. Out of the blue the active MDS
> stopped working:
>
> mon.mon1 [WRN] daemon mds.mds1 is not responding, replacing it as rank 0
> with standby
> daemon mds.mds2.
>
> Logging of ceph-mds1:
>
> 2018-10-04 10:50:08.524745 7fdd516bf700 1 mds.mds1 asok_command: status
> (starting...)
> 2018-10-04 10:50:08.524782 7fdd516bf700 1 mds.mds1 asok_command: status
> (complete)
>
> ^^ one of our monitoring health checks performing a "ceph daemon mds.mds1
> version", business as usual.
>
> 2018-10-04 10:52:36.712525 7fdd51ec0700 1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-10-04 10:52:36.747577 7fdd4deb8700 1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-10-04 10:52:36.747584 7fdd4deb8700 1 mds.beacon.mds1 _send skipping
> beacon, heartbeat map not healthy
>
> ^^ the unresponsive mds1 consumes 100% CPU and keeps on logging above
> heatbeat_map messages.
>
> In the meantime ceph-mds2 has transitioned from "standby-replay" to
> "active":
>
> mon.mon1 [INF] daemon mds.mds2 is now active in filesystem
> BITED-153874-cephfs as rank 0
>
> Logging:
>
> replays, final replay as standby, reopen log
>
> 2018-10-04 10:52:53.268470 7fdb231d9700 1 mds.0.141 reconnect_done
> 2018-10-04 10:52:53.759844 7fdb231d9700 1 mds.mds2 Updating MDS map to
> version 143 from mon.3
> 2018-10-04 10:52:53.759859 7fdb231d9700 1 mds.0.141 handle_mds_map i am
> now mds.0.141
> 2018-10-04 10:52:53.759862 7fdb231d9700 1 mds.0.141 handle_mds_map state
> change up:reconnect --> up:rejoin
> 2018-10-04 10:52:53.759868 7fdb231d9700 1 mds.0.141 rejoin_start
> 2018-10-04 10:52:53.759970 7fdb231d9700 1 mds.0.141 rejoin_joint_start
> 2018-10-04 10:52:53.760970 7fdb1d1cd700 0 mds.0.cache failed to open ino
> 0x1cd95e9 err -5/0
> 2018-10-04 10:52:54.126658 7fdb1d1cd700 1 mds.0.141 rejoin_done
> 2018-10-04 10:52:54.770457 7fdb231d9700 1 mds.mds2 Updating MDS map to
> version 144 from mon.3
> 2018-10-04 10:52:54.770484 7fdb231d9700 1 mds.0.141 handle_mds_map i am
> now mds.0.141
> 2018-10-04 10:52:54.770487 7fdb231d9700 1 mds.0.141 handle_mds_map state
> change up:rejoin --> up:clientreplay
> 2018-10-04 10:52:54.770494 7fdb231d9700 1 mds.0.141 recovery_done --
> successful recovery!
> 2018-10-04 10:52:54.770617 7fdb231d9700 1 mds.0.141 clientreplay_start
> 2018-10-04 10:52:54.882995 7fdb1d1cd700 1 mds.0.141 clientreplay_done
> 2018-10-04 10:52:55.778598 7fdb231d9700 1 mds.mds2 Updating MDS map to
> version 145 from mon.3
> 2018-10-04 10:52:55.778622 7fdb231d9700 1 mds.0.141 handle_mds_map i am
> now mds.0.141
> 2018-10-04 10:52:55.778628 7fdb231d9700 1 mds.0.141 handle_mds_map state
> change up:clientreplay --> up:active
> 2018-10-04 10:52:55.778638 7fdb231d9700 1 mds.0.141 active_start
> 2018-10-04 10:52:55.805206 7fdb231d9700 1 mds.0.141 cluster recovered.
>
> And then it _also_ starts to log hearbeat_map messages (and consuming 100%
> CPU):
>
> en dan deze meldingen die zichzelf blijven herhalen bij 100% cpu
> 2018-10-04 10:53:41.550793 7fdb241db700 1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-10-04 10:53:42.884018 7fdb201d3700 1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-10-04 10:53:42.884024 7fdb201d3700 1 mds.beacon.mds2 _send skipping
> beacon, heartbeat map not healthy
>
> At that point in time there is one active MDS according to ceph, but in
> reality it's
> not functioning correctly (not serving clients at least).
>
> ... we stopped both daemons. Restarted one ... recovery ...
> works for half a minute ... then starts logging heartbeat_map messages.
> Restart again ... works for a little while ... starts logging
> heartbeat_map messages again. We restart the mds with debug_mds=20 
> it keeps working fine. The other mds gets restarted and keeps on
> working. We do a couple of failover tests  works flawlessly
> (failover in < 1 second, clients reconnect instantly).
>
> A couple of hours later we hit the same issue. We restarted with
> debug_mds=20 and debug_journaler=20 on the standby-replay node. Eight
> hours later (an hour ago) we hit the same issue. We captured ~ 4.7 GB of
> logging I skipped to the end of the log file just before the
> "hearbeat_map" messages start:
>
> 2018-10-04 23:23:53.144644 7f415ebf4700 20 mds.0.locker  client.17079146
> pending pAsLsXsFscr allowed pAsLsXsFscr wanted pFscr
> 2018-10-04 23:23:53.144645 7f415ebf4700 10 mds.0.locker eval done
> 2018-10-04 23:23:55.088542 7f415bbee700 10 mds.beacon.mds2 _send up:active
> seq 5021
> 2018-10-04 23:23:59.088602 7f415bbee700 10 mds.beacon.mds2 _send up:active
> seq 5022
> 2018-10-04 23:24:03.088688 7f415bbee700 10 mds.beacon.mds2 _send up:active
> seq 5023
> 2018-10-04 23:24:07.088775 7f415bbee700 10 mds.beacon.mds2 _send up:active
> seq 5024
> 2018-10-04 23:24:11.088867 7f415bbee700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-10-04 23:24:11.088871 

Re: [ceph-users] Cannot write to cephfs if some osd's are not available on the client network

2018-10-05 Thread Gregory Farnum
On Fri, Oct 5, 2018 at 3:13 AM Marc Roos  wrote:

>
>
> I guess then this waiting "quietly" should be looked at again, I am
> having load of 10 on this vm.
>
> [@~]# uptime
>  11:51:58 up 4 days,  1:35,  1 user,  load average: 10.00, 10.01, 10.05
>
> [@~]# uname -a
> Linux smb 3.10.0-862.11.6.el7.x86_64 #1 SMP Tue Aug 14 21:49:04 UTC 2018
> x86_64 x86_64 x86_64 GNU/Linux
>
> [@~]# cat /etc/redhat-release
> CentOS Linux release 7.5.1804 (Core)
>
> [@~]# dmesg
> [348948.927734] libceph: osd23 192.168.10.114:6810 socket closed (con
> state CONNECTING)
> [348957.120090] libceph: osd27 192.168.10.114:6802 socket closed (con
> state CONNECTING)
> [349010.370171] libceph: osd26 192.168.10.114:6806 socket closed (con
> state CONNECTING)
> [349114.822301] libceph: osd24 192.168.10.114:6804 socket closed (con
> state CONNECTING)
> [349141.447330] libceph: osd29 192.168.10.114:6812 socket closed (con
> state CONNECTING)
> [349278.668658] libceph: osd25 192.168.10.114:6800 socket closed (con
> state CONNECTING)
> [349440.467038] libceph: osd28 192.168.10.114:6808 socket closed (con
> state CONNECTING)
> [349465.043957] libceph: osd23 192.168.10.114:6810 socket closed (con
> state CONNECTING)
> [349473.236400] libceph: osd27 192.168.10.114:6802 socket closed (con
> state CONNECTING)
> [349526.486408] libceph: osd26 192.168.10.114:6806 socket closed (con
> state CONNECTING)
> [349630.938498] libceph: osd24 192.168.10.114:6804 socket closed (con
> state CONNECTING)
> [349657.563561] libceph: osd29 192.168.10.114:6812 socket closed (con
> state CONNECTING)
> [349794.784936] libceph: osd25 192.168.10.114:6800 socket closed (con
> state CONNECTING)
> [349956.583300] libceph: osd28 192.168.10.114:6808 socket closed (con
> state CONNECTING)
> [349981.160225] libceph: osd23 192.168.10.114:6810 socket closed (con
> state CONNECTING)
> [349989.352510] libceph: osd27 192.168.10.114:6802 socket closed (con
> state CONNECTING)
>

Looks like in this case the client is spinning trying to establish the
network connections it expects to be available. There's not really much
else it can do — we expect and require full routing. The monitors are
telling the clients that the OSDs are up and available, and it is doing
data IO that requires them. So it tries to establish a connection, sees the
network fail, and tries again.

Unfortunately the restricted-network use case you're playing with here is
just not supported by Ceph.
-Greg


> ..
> ..
> ..
>
>
>
>
> -Original Message-
> From: John Spray [mailto:jsp...@redhat.com]
> Sent: donderdag 27 september 2018 11:43
> To: Marc Roos
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Cannot write to cephfs if some osd's are not
> available on the client network
>
> On Thu, Sep 27, 2018 at 10:16 AM Marc Roos 
> wrote:
> >
> >
> > I have a test cluster and on a osd node I put a vm. The vm is using a
> > macvtap on the client network interface of the osd node. Making access
>
> > to local osd's impossible.
> >
> > the vm of course reports that it cannot access the local osd's. What I
>
> > am getting is:
> >
> > - I cannot reboot this vm normally, need to reset it.
>
> When linux tries to shut down cleanly, part of that is flushing buffers
> from any mounted filesystem back to disk.  If you have a network
> filesystem mounted, and the network is unavailable, that can cause the
> process to block.  You can try forcibly unmounting before rebooting.
>
> > - vm is reporting very high load.
>
> The CPU load part is surprising -- in general Ceph clients should wait
> quietly when blocked, rather than spinning.
>
> > I guess this should not be happening not? Because it should choose an
> > other available osd of the 3x replicated pool and just write the data
> > to that one?
>
> No -- writes always go through the primary OSD for the PG being written
> to.  If an OSD goes down, then another OSD will become the primary.  In
> your case, the primary OSD is not going down, it's just being cut off
> from the client by the network, so the writes are blocking indefinitely.
>
> John
>
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] provide cephfs to mutiple project

2018-10-05 Thread Gregory Farnum
Check out http://docs.ceph.com/docs/master/cephfs/client-auth/

On Wed, Oct 3, 2018 at 8:58 PM Joshua Chen 
wrote:

> Hello all,
>   I am almost ready to provide storage (cephfs in the beginning) to my
> colleagues, they belong to different main project, and according to their
> budget that are previously claimed, to have different capacity. For example
> ProjectA will have 50TB, ProjectB will have 150TB.
>
> I choosed cephfs because that it has good enough throughput compared to
> rbd.
>
> but I would like to let clients in ProjectA only see 50TB mount space (by
> linux df -h maybe) and ProjectB clients see 150TB. so my question is:
> 1, is that possible? that cephfs make clients see different available
> space respectively?
>
> 2, what is the good setup that ProjectA has a reasonable mount source and
> ProjectB has his?
>
> for example
> in projecta client root, he will do
> mount -t ceph cephmon1,cephmon2:/ProjectA /mnt/ProjectA
>
> but can not
>
> mount -t ceph cephmon1,cephmon2:/ProjectB /mnt/ProjectB
>
> (can not mount the root /, either /ProjectB which is not their area)
>
> or what is the official production style for this need?
>
> Thank in advance
> Cheers
> Joshua
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] interpreting ceph mds stat

2018-10-05 Thread Gregory Farnum
On Wed, Oct 3, 2018 at 10:09 AM Jeff Smith  wrote:

> I need some help deciphering the results of ceph mds stat.  I have
> been digging in the docs for hours.  If someone can point me in the
> right direction and/or help me understand.
>
> In the documentation it shows a result like this.
>
> cephfs-1/1/1 up {0=a=up:active}
>
> What do each of the 1s represent?


//


>What is the 0=a=up:active?  Is
> that saying rank 0 of file system a is up:active?
>

Rank 0 is assigned to mds.a and it is up in the active state. I forget if
the fs name is elided in a single-fs output or if it's the "cephfs-" bit at
the beginning.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Some questions concerning filestore --> bluestore migration

2018-10-05 Thread Mark Nelson
FWIW, here are values I measured directly from the RocksDB SST files 
under different small write workloads (ie the ones where you'd expect a 
larger DB footprint):


https://drive.google.com/file/d/1Ews2WR-y5k3TMToAm0ZDsm7Gf_fwvyFw/view?usp=sharing

These tests were only with 256GB of data written to a single OSD, so 
there's no guarantee that it will scale linearly up to 10TB 
(specifically it's possible that much larger RocksDB databases could 
have higher space amplification).  Also note that the RGW numbers could 
be very dependent on the client workload and are not likely universally 
representative.


Also remember that if you run out of space on your DB partitions you'll 
just end up putting higher rocksdb levels on the block device.  Slower 
to be sure, but not necessarily worse than filestore's behavior 
(especially in the RGW case, where the large object counts will cause PG 
directory splitting chaos).


Mark

On 10/05/2018 01:38 PM, solarflow99 wrote:
oh my.. yes 2TB enterprise class SSDs, that a much higher requirement 
than filestore needed.  That would be cost prohibitive to any lower 
end ceph cluster,




On Thu, Oct 4, 2018 at 11:19 PM Massimo Sgaravatto 
mailto:massimo.sgarava...@gmail.com>> 
wrote:


Argg !!
With 10x10TB SATA DB and 2 SSD disks this would mean 2 TB for each
SSD !
If this is really required I am afraid I will keep using filestore ...

Cheers, Massimo

On Fri, Oct 5, 2018 at 7:26 AM mailto:c...@elchaka.de>> wrote:

Hello

Am 4. Oktober 2018 02:38:35 MESZ schrieb solarflow99
mailto:solarflo...@gmail.com>>:
>I use the same configuration you have, and I plan on using
bluestore.
>My
>SSDs are only 240GB and it worked with filestore all this time, I
>suspect
>bluestore should be fine too.
>
>
>On Wed, Oct 3, 2018 at 4:25 AM Massimo Sgaravatto <
>massimo.sgarava...@gmail.com
> wrote:
>
>> Hi
>>
>> I have a ceph cluster, running luminous, composed of 5 OSD
nodes,
>which is
>> using filestore.
>> Each OSD node has 2 E5-2620 v4 processors, 64 GB of RAM,
10x6TB SATA
>disk
>> + 2x200GB SSD disk (then I have 2 other disks in RAID for
the OS), 10
>Gbps.
>> So each SSD disk is used for the journal for 5 OSDs. With this
>> configuration everything is running smoothly ...
>>
>>
>> We are now buying some new storage nodes, and I am trying
to buy
>something
>> which is bluestore compliant. So the idea is to consider a
>configuration
>> something like:
>>
>> - 10 SATA disks (8TB / 10TB / 12TB each. TBD)
>> - 2 processor (~ 10 core each)
>> - 64 GB of RAM
>> - 2 SSD to be used for WAL+DB
>> - 10 Gbps
>>
>> For what concerns the size of the SSD disks I read in this
mailing
>list
>> that it is suggested to have at least 10GB of SSD disk/10TB
of SATA
>disk.
>>
>>
>> So, the questions:
>>
>> 1) Does this hardware configuration seem reasonable ?
>>
>> 2) Are there problems to live (forever, or until filestore
>deprecation)
>> with some OSDs using filestore (the old ones) and some OSDs
using
>bluestore
>> (the old ones) ?
>>
>> 3) Would you suggest to update to bluestore also the old
OSDs, even
>if the
>> available SSDs are too small (they don't satisfy the "10GB
of SSD
>disk/10TB
>> of SATA disk" rule) ?

AFAIR should the db size 4% of the osd in question.

So

For example, if the block size is 1TB, then block.db shouldn’t
be less than 40GB

See:

http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/

Hth
- Mehmet

>>
>> Thanks, Massimo
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_

Re: [ceph-users] Some questions concerning filestore --> bluestore migration

2018-10-05 Thread solarflow99
oh my.. yes 2TB enterprise class SSDs, that a much higher requirement than
filestore needed.  That would be cost prohibitive to any lower end ceph
cluster,



On Thu, Oct 4, 2018 at 11:19 PM Massimo Sgaravatto <
massimo.sgarava...@gmail.com> wrote:

> Argg !!
> With 10x10TB SATA DB and 2 SSD disks this would mean 2 TB for each SSD !
> If this is really required I am afraid I will keep using filestore ...
>
> Cheers, Massimo
>
> On Fri, Oct 5, 2018 at 7:26 AM  wrote:
>
>> Hello
>>
>> Am 4. Oktober 2018 02:38:35 MESZ schrieb solarflow99 <
>> solarflo...@gmail.com>:
>> >I use the same configuration you have, and I plan on using bluestore.
>> >My
>> >SSDs are only 240GB and it worked with filestore all this time, I
>> >suspect
>> >bluestore should be fine too.
>> >
>> >
>> >On Wed, Oct 3, 2018 at 4:25 AM Massimo Sgaravatto <
>> >massimo.sgarava...@gmail.com> wrote:
>> >
>> >> Hi
>> >>
>> >> I have a ceph cluster, running luminous, composed of 5 OSD nodes,
>> >which is
>> >> using filestore.
>> >> Each OSD node has 2 E5-2620 v4 processors, 64 GB of RAM, 10x6TB SATA
>> >disk
>> >> + 2x200GB SSD disk (then I have 2 other disks in RAID for the OS), 10
>> >Gbps.
>> >> So each SSD disk is used for the journal for 5 OSDs. With this
>> >> configuration everything is running smoothly ...
>> >>
>> >>
>> >> We are now buying some new storage nodes, and I am trying to buy
>> >something
>> >> which is bluestore compliant. So the idea is to consider a
>> >configuration
>> >> something like:
>> >>
>> >> - 10 SATA disks (8TB / 10TB / 12TB each. TBD)
>> >> - 2 processor (~ 10 core each)
>> >> - 64 GB of RAM
>> >> - 2 SSD to be used for WAL+DB
>> >> - 10 Gbps
>> >>
>> >> For what concerns the size of the SSD disks I read in this mailing
>> >list
>> >> that it is suggested to have at least 10GB of SSD disk/10TB of SATA
>> >disk.
>> >>
>> >>
>> >> So, the questions:
>> >>
>> >> 1) Does this hardware configuration seem reasonable ?
>> >>
>> >> 2) Are there problems to live (forever, or until filestore
>> >deprecation)
>> >> with some OSDs using filestore (the old ones) and some OSDs using
>> >bluestore
>> >> (the old ones) ?
>> >>
>> >> 3) Would you suggest to update to bluestore also the old OSDs, even
>> >if the
>> >> available SSDs are too small (they don't satisfy the "10GB of SSD
>> >disk/10TB
>> >> of SATA disk" rule) ?
>>
>> AFAIR should the db size 4% of the osd in question.
>>
>> So
>>
>> For example, if the block size is 1TB, then block.db shouldn’t be less
>> than 40GB
>>
>> See:
>> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/
>>
>> Hth
>> - Mehmet
>>
>> >>
>> >> Thanks, Massimo
>> >> ___
>> >> ceph-users mailing list
>> >> ceph-users@lists.ceph.com
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] deep scrub error caused by missing object

2018-10-05 Thread ceph
Hello Roman,

I am Not sure if i could be a help but perhaps this Commands can help to find 
the objects in question...

Ceph Heath Detail
rados list-inconsistent-pg rbd
rados list-inconsistent-obj 2.10d

I guess it is also interresting to know you use  bluestore or filestore...

Hth
- Mehmet 

Am 4. Oktober 2018 14:06:07 MESZ schrieb Roman Steinhart :
>Hi all,
>
>since some weeks we have a small problem with one of the PG's on our
>ceph cluster.
>Every time the pg 2.10d is deep scrubbing it fails because of this:
>2018-08-06 19:36:28.080707 osd.14 osd.14 *.*.*.110:6809/3935 133 :
>cluster [ERR] 2.10d scrub stat mismatch, got 397/398 objects, 0/0
>clones, 397/398 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0
>whiteouts, 2609281919/2609293215 bytes, 0/0 hit_set_archive bytes.
>2018-08-06 19:36:28.080905 osd.14 osd.14 *.*.*.110:6809/3935 134 :
>cluster [ERR] 2.10d scrub 1 errors
>As far as I understand ceph is missing an object on that osd.14 which
>should be stored on this osd. A small ceph pg repair 2.10d fixes the
>problem but as soon as a deep scrubbing job for that pg is running
>again(manual or automatically) the problem is back again.
>I tried to find out which object is missing, but a small search leads
>me to the result that there is no real way to find out which objects
>are stored in this PG or which object exactly is missing.
>That's why I've gone for some "unconventional" methods.
>I completely removed OSD.14 from the cluster. I waited until everything
>was balanced and then added the OSD again.
>Unfortunately the problem is still there.
>
>Some weeks later we've added a huge amount of OSD's to our cluster
>which had a big impact on the crush map.
>Since then the PG 2.10d was running on two other OSD's -> [119,93] (We
>have a replica of 2)
>Still the same error message, but another OSD:
>2018-10-03 03:39:22.776521 7f12d9979700 -1 log_channel(cluster) log
>[ERR] : 2.10d scrub stat mismatch, got 728/729 objects, 0/0 clones,
>728/729 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0
>whiteouts, 7281369687/7281381269 bytes, 0/0 hit_set_archive bytes.
>
>As a first step it would be enough for me to find out which the
>problematic object is. Then I am able to check if the object is
>critical, if any recovery is required or if I am able to just drop that
>object(That would be 90% of the case)
>I hope anyone is able to help me to get rid of this.
>It's not really a problem for us. Ceph runs despite this message
>without further problems.
>It's just a bit annoying that every time the error occurs our
>monitoring triggers a big alarm because Ceph is in ERROR status. :)
>
>Thanks in advance,
>Roman
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster broken and ODSs crash with failed assertion in PGLog::merge_log

2018-10-05 Thread Neha Ojha
Hi JJ,

In the case, the condition olog.head >= log.tail is not true,
therefore it crashes. Could you please open a tracker
issue(https://tracker.ceph.com/) and attach the osd logs and the pg
dump output?

Thanks,
Neha

On Thu, Oct 4, 2018 at 9:29 AM, Jonas Jelten  wrote:
> Hello!
>
> Unfortunately, our single-node-"Cluster" with 11 ODSs is broken because some 
> ODSs crash when they start peering.
> I'm on Ubuntu 18.04 with Ceph Mimic (13.2.2).
>
> The problem was induced by when RAM was filled up and ODS processes then 
> crashed because of memory allocation failures.
>
> No weird commands (e.g. force_create_pg) were used on this cluster and it was 
> set up with 13.2.1 initially.
> The affected pool seems to be a replicated pool with size=3 and min_size=2 
> (which haven't been changed).
>
> Crash log of osd.4 (only the crashed thread):
>
> 99424: -1577> 2018-10-04 13:40:11.024 7f3838417700 10 log is not dirty
> 99425: -1576> 2018-10-04 13:40:11.024 7f3838417700 10 osd.4 1433 
> queue_want_up_thru want 1433 <= queued 1433, currently 1426
> 99427: -1574> 2018-10-04 13:40:11.024 7f3838417700 20 osd.4 op_wq(3) _process 
> 3.8 to_process <> waiting <>
> waiting_peering {}
> 99428: -1573> 2018-10-04 13:40:11.024 7f3838417700 20 osd.4 op_wq(3) _process 
> OpQueueItem(3.8 PGPeeringEvent(epoch_sent:
> 1433 epoch_requested: 1433 MNotifyRec 3.8 from 2 notify: (query:1433 
> sent:1433 3.8( v 866'122691 (569'119300,866'122691]
> local-lis/les=1401/1402 n=54053 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 
> 1433/1433/1433)) features:
> 0x3ffddff8ffa4fffb ([859,1432] intervals=([1213,1215] acting 
> 0,2),([1308,1311] acting 4,10),([1401,1403] acting
> 2,10),([1426,1428] acting 2,4)) +create_info) prio 255 cost 10 e1433) queued
> 99430: -1571> 2018-10-04 13:40:11.024 7f3838417700 20 osd.4 op_wq(3) _process 
> 3.8 to_process  PGPeeringEvent(epoch_sent: 1433 epoch_requested: 1433 MNotifyRec 3.8 from 2 
> notify: (query:1433 sent:1433 3.8( v
> 866'122691 (569'119300,866'122691] local-lis/les=1401/1402 n=54053 ec=126/126 
> lis/c 1401/859 les/c/f 1402/860/0
> 1433/1433/1433)) features: 0x3ffddff8ffa4fffb ([859,1432] 
> intervals=([1213,1215] acting 0,2),([1308,1311] acting
> 4,10),([1401,1403] acting 2,10),([1426,1428] acting 2,4)) +create_info) prio 
> 255 cost 10 e1433)> waiting <>
> waiting_peering {}
> 99433: -1568> 2018-10-04 13:40:11.024 7f3838417700 20 osd.4 op_wq(3) _process 
> OpQueueItem(3.8 PGPeeringEvent(epoch_sent:
> 1433 epoch_requested: 1433 MNotifyRec 3.8 from 2 notify: (query:1433 
> sent:1433 3.8( v 866'122691 (569'119300,866'122691]
> local-lis/les=1401/1402 n=54053 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 
> 1433/1433/1433)) features:
> 0x3ffddff8ffa4fffb ([859,1432] intervals=([1213,1215] acting 
> 0,2),([1308,1311] acting 4,10),([1401,1403] acting
> 2,10),([1426,1428] acting 2,4)) +create_info) prio 255 cost 10 e1433) pg 
> 0x56013bc87400
> 99437: -1564> 2018-10-04 13:40:11.024 7f3838417700 10 osd.4 pg_epoch: 1433 
> pg[3.8( v 866'127774 (866'124700,866'127774]
> local-lis/les=859/860 n=56570 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 
> 1433/1433/1433) [4,2] r=0 lpr=1433
> pi=[859,1433)/4 crt=866'127774 lcod 0'0 mlcod 0'0 peering mbc={}] 
> do_peering_event: epoch_sent: 1433 epoch_requested:
> 1433 MNotifyRec 3.8 from 2 notify: (query:1433 sent:1433 3.8( v 866'122691 
> (569'119300,866'122691]
> local-lis/les=1401/1402 n=54053 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 
> 1433/1433/1433)) features:
> 0x3ffddff8ffa4fffb ([859,1432] intervals=([1213,1215] acting 
> 0,2),([1308,1311] acting 4,10),([1401,1403] acting
> 2,10),([1426,1428] acting 2,4)) +create_info
> 99440: -1561> 2018-10-04 13:40:11.024 7f3838417700  7 osd.4 pg_epoch: 1433 
> pg[3.8( v 866'127774 (866'124700,866'127774]
> local-lis/les=859/860 n=56570 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 
> 1433/1433/1433) [4,2] r=0 lpr=1433
> pi=[859,1433)/4 crt=866'127774 lcod 0'0 mlcod 0'0 peering mbc={}] 
> state: handle_pg_notify from osd.2
> 99444: -1557> 2018-10-04 13:40:11.024 7f3838417700 10 osd.4 pg_epoch: 1433 
> pg[3.8( v 866'127774 (866'124700,866'127774]
> local-lis/les=859/860 n=56570 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 
> 1433/1433/1433) [4,2] r=0 lpr=1433
> pi=[859,1433)/4 crt=866'127774 lcod 0'0 mlcod 0'0 peering mbc={}]  got dup 
> osd.2 info 3.8( v 866'122691
> (569'119300,866'122691] local-lis/les=1401/1402 n=54053 ec=126/126 lis/c 
> 1401/859 les/c/f 1402/860/0 1433/1433/1433),
> identical to ours
> 99445: -1556> 2018-10-04 13:40:11.024 7f3838417700 10 log is not dirty
> 99446: -1555> 2018-10-04 13:40:11.024 7f3838417700 10 osd.4 1433 
> queue_want_up_thru want 1433 <= queued 1433, currently 1426
> 99448: -1553> 2018-10-04 13:40:11.024 7f3838417700 20 osd.4 op_wq(3) _process 
> 3.8 to_process <> waiting <>
> waiting_peering {}
> 99450: -1551> 2018-10-04 13:40:11.024 7f3838417700 20 osd.4 op_wq(3) _process 
> OpQueueItem(3.8 PGPeeringEvent(epoch_sent:
> 1433 epoch_requested: 1433 MLogRec from 2 +create_info)

Re: [ceph-users] Mimic offline problem

2018-10-05 Thread Sage Weil
Quick update here:

The problem with the OSDs that are throwing rocksdb errors (missing SST 
files) is that ceph-kvstore-tool bluestore-kv ... repair was run on OSDs, 
and it looks like the rocksdb repair function actually broke the 
(non-broken) rocksdb instance.  I'm not quite sure why that is the 
case--seems like a pretty big problem for a repair to be unsafe--so that 
is something we need to follow up on with the rocksdb folks.

The (possible) good news is that the structure of bluefs is such that 
it looks like we can limit replay of the internal journal and effectively 
roll back the changes made by repair.  That seems to have worked on one 
OSD at least; need to see if it works on others too.

However, even if all those are fixed, various other low-level 
ceph-objectstore-tool commands were run on the OSDs and I have a feeling 
those are going to prevent recovery.  :(

Moral of the story: don't run random low-level *-tool commands on your 
system if you don't know what they do or whether they are needed!

sage


On Thu, 4 Oct 2018, Goktug Yildirim wrote:

> This is ceph-object-store tool logs for OSD.0.
> 
> https://paste.ubuntu.com/p/jNwf4DC46H/
> 
> There is something wrong. But we are not sure if we cant use the tool or 
> there is something wrong with OSD.
> 
> 
> > On 4 Oct 2018, at 06:17, Sage Weil  wrote:
> > 
> > On Thu, 4 Oct 2018, Goktug Yildirim wrote:
> >> This is our cluster state right now. I can reach rbd list and thats good! 
> >> Thanks a lot Sage!!!
> >> ceph -s: https://paste.ubuntu.com/p/xBNPr6rJg2/
> > 
> > Progress!  Not out of the woods yet, though...
> > 
> >> As you can see we have 2 unfound pg since some of our OSDs can not start. 
> >> 58 OSD gives different errors.
> >> How can I fix these OSD's? If I remember correctly it should not be so 
> >> much trouble.
> >> 
> >> These are OSDs' failed logs.
> >> https://paste.ubuntu.com/p/ZfRD5ZtvpS/
> >> https://paste.ubuntu.com/p/pkRdVjCH4D/
> > 
> > These are both failing in rocksdb code, with something like
> > 
> > Can't access /032949.sst: NotFound:
> > 
> > Can you check whether that .sst file actually exists?  Might be a 
> > weird path issue.
> > 
> >> https://paste.ubuntu.com/p/zJTf2fzSj9/
> >> https://paste.ubuntu.com/p/xpJRK6YhRX/
> > 
> > These are failing in the rocksdb CheckConstency code.  Not sure what to 
> > make of that.
> > 
> >> https://paste.ubuntu.com/p/SY3576dNbJ/
> >> https://paste.ubuntu.com/p/smyT6Y976b/
> > 
> > These are failing in BlueStore code.  The ceph-blustore-tool fsck may help 
> > here, can you give it a shot?
> > 
> > sage
> > 
> > 
> >> 
> >>> On 3 Oct 2018, at 21:37, Sage Weil  wrote:
> >>> 
> >>> On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote:
>  I'm so sorry about that I missed "out" parameter. My bad..
>  This is the output: https://paste.ubuntu.com/p/KwT9c8F6TF/
> >>> 
> >>> Excellent, thanks.  That looks like it confirms the problem is that teh 
> >>> recovery tool didn't repopulate the creating pgs properly.
> >>> 
> >>> If you take that 30 byte file I sent earlier (as hex) and update the 
> >>> osdmap epoch to the latest on the mon, confirm it decodes and dumps 
> >>> properly, and then inject it on the 3 mons, that should get you past this 
> >>> hump (and hopefully back up!).
> >>> 
> >>> sage
> >>> 
> >>> 
>  
>  Sage Weil  şunları yazdı (3 Eki 2018 21:13):
>  
> > I bet the kvstore output it in a hexdump format?  There is another 
> > option to get the raw data iirc
> > 
> > 
> > 
> >> On October 3, 2018 3:01:41 PM EDT, Goktug YILDIRIM 
> >>  wrote:
> >> I changed the file name to make it clear.
> >> When I use your command with "+decode"  I'm getting an error like this:
> >> 
> >> ceph-dencoder type creating_pgs_t import DUMPFILE decode dump_json
> >> error: buffer::malformed_input: void 
> >> creating_pgs_t::decode(ceph::buffer::list::iterator&) no longer 
> >> understand old encoding version 2 < 111
> >> 
> >> My ceph version: 13.2.2
> >> 
> >> 3 Eki 2018 Çar, saat 20:46 tarihinde Sage Weil  
> >> şunu yazdı:
> >>> On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote:
>  If I didn't do it wrong, I got the output as below.
>  
>  ceph-kvstore-tool rocksdb 
>  /var/lib/ceph/mon/ceph-SRV-SBKUARK14/store.db/ get osd_pg_creating 
>  creating > dump
>  2018-10-03 20:08:52.070 7f07f5659b80  1 rocksdb: do_open column 
>  families: [default]
>  
>  ceph-dencoder type creating_pgs_t import dump dump_json
> >>> 
> >>> Sorry, should be
> >>> 
> >>> ceph-dencoder type creating_pgs_t import dump decode dump_json
> >>> 
> >>> s
> >>> 
>  {
>    "last_scan_epoch": 0,
>    "creating_pgs": [],
>    "queue": [],
>    "created_pools": []
>  }
>  
>  You can find the "dump" link below.
>  
>  dump: 
>  https://drive.google.com/f

Re: [ceph-users] Erasure coding with more chunks than servers

2018-10-05 Thread Paul Emmerich
Oh, and you'll need to use m>=3 to ensure availability during a node failure.


Paul
Am Fr., 5. Okt. 2018 um 11:22 Uhr schrieb Caspar Smit :
>
> Hi Vlad,
>
> You can check this blog: 
> http://cephnotes.ksperis.com/blog/2017/01/27/erasure-code-on-small-clusters
>
> Note! Be aware that these settings do not automatically cover a node failure.
>
> Check out this thread why:
>
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/024423.html
>
> Kind regards,
> Caspar
>
>
> Op do 4 okt. 2018 om 20:27 schreef Vladimir Brik 
> :
>>
>> Hello
>>
>> I have a 5-server cluster and I am wondering if it's possible to create
>> pool that uses k=5 m=2 erasure code. In my experiments, I ended up with
>> pools whose pgs are stuck in creating+incomplete state even when I
>> created the erasure code profile with --crush-failure-domain=osd.
>>
>> Assuming that what I want to do is possible, will CRUSH distribute
>> chunks evenly among servers, so that if I need to bring one server down
>> (e.g. reboot), clients' ability to write or read any object would not be
>> disrupted? (I guess something would need to ensure that no server holds
>> more than two chunks of an object)
>>
>> Thanks,
>>
>> Vlad
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best handling network maintenance

2018-10-05 Thread Darius Kasparavičius
Hello,


I would have risked a nodown option for this short downtime. We had a
similar experience when we updated a bonded switch and had reboot it.
Some of the connections dropped and whole cluster started marking some
osds as down. Due to this almost all osd were marked as down, but none
of the processes stopped. When we rebooted next switch, we used a
nodown and there were no flapping osds at all that time. Best of luck
next time.


On Fri, Oct 5, 2018 at 4:03 PM Martin Palma  wrote:
>
> Thank you all for the clarification and suggestion.
>
> Here is a small experience report what happened during the network
> maintenance, maybe it is useful for others too:
>
> As previously written the Ceph cluster is stretched across two data
> centers and has a size of 39 storage nodes with a total of 525 OSDs
> and 5 monitor nodes.
>
> The problem: Due to a network maintenance the connection between the
> two data center will be down for approximately 8-15 seconds, which
> will affect the Ceph cluster's public network.
>
> Before the maintenance we set the following flags:
> "noout, nobackfill, norebalance, noscrub, nodeep-scrub"
>
> During maintenance: The network between the two data centers was down
> for a total of 12 seconds. At first, it seemed everything worked fine.
> Some OSDs where marked as down but came back quickly and monitor nodes
> started a new election. But then more and more OSDs where marked as
> down wrongly, in reality, their process was up and they had network
> connectivity.  Moreover, two monitors of one data center couldn't join
> the quorum anymore. The network team quickly figured out what the
> problem was: A wrong MTU size. After they fixed it the two monitor
> nodes rejoined the quorum and nearly all OSDs came up again. Only 36
> OSD remained down and by checking them it revealed they were really
> down. After a total time of 40 minutes, the cluster reached a healthy
> state again. No data loss.
>
> Best,
> Martin
> On Thu, Oct 4, 2018 at 11:09 AM Paul Emmerich  wrote:
> >
> > Mons are also on a 30s timeout.
> > Even a short loss of quorum isn‘t noticeable for ongoing IO.
> >
> > Paul
> >
> > > Am 04.10.2018 um 11:03 schrieb Martin Palma :
> > >
> > > Also monitor election? That is the most fear we have since the monitor
> > > nodes will no see each other for that timespan...
> > >> On Thu, Oct 4, 2018 at 10:21 AM Paul Emmerich  
> > >> wrote:
> > >>
> > >> 10 seconds is far below any relevant timeout values (generally 20-30 
> > >> seconds); so you will be fine without any special configuration.
> > >>
> > >> Paul
> > >>
> > >> Am 04.10.2018 um 09:38 schrieb Konstantin Shalygin :
> > >>
> >  What can we do of best handling this scenario to have minimal or no
> >  impact on Ceph?
> > 
> >  We plan to set "noout", "nobackfill", "norebalance", "noscrub",
> >  "nodeep",  "scrub" are there any other suggestions?
> > >>>
> > >>> ceph osd set noout
> > >>>
> > >>> ceph osd pause
> > >>>
> > >>>
> > >>>
> > >>> k
> > >>>
> > >>> ___
> > >>> ceph-users mailing list
> > >>> ceph-users@lists.ceph.com
> > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inconsistent directory content in cephfs

2018-10-05 Thread Sergey Malinin
Are you sure these mounts (work/06 and work/6c) refer to the same directory?

> On 5.10.2018, at 13:57, Burkhard Linke 
>  wrote:
> 
> root@host2:~# ls /ceph/sge-tmp/db/work/06/ | wc -l
...
> root@host3:~# ls /ceph/sge-tmp/db/work/6c | wc -l

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best handling network maintenance

2018-10-05 Thread Martin Palma
Thank you all for the clarification and suggestion.

Here is a small experience report what happened during the network
maintenance, maybe it is useful for others too:

As previously written the Ceph cluster is stretched across two data
centers and has a size of 39 storage nodes with a total of 525 OSDs
and 5 monitor nodes.

The problem: Due to a network maintenance the connection between the
two data center will be down for approximately 8-15 seconds, which
will affect the Ceph cluster's public network.

Before the maintenance we set the following flags:
"noout, nobackfill, norebalance, noscrub, nodeep-scrub"

During maintenance: The network between the two data centers was down
for a total of 12 seconds. At first, it seemed everything worked fine.
Some OSDs where marked as down but came back quickly and monitor nodes
started a new election. But then more and more OSDs where marked as
down wrongly, in reality, their process was up and they had network
connectivity.  Moreover, two monitors of one data center couldn't join
the quorum anymore. The network team quickly figured out what the
problem was: A wrong MTU size. After they fixed it the two monitor
nodes rejoined the quorum and nearly all OSDs came up again. Only 36
OSD remained down and by checking them it revealed they were really
down. After a total time of 40 minutes, the cluster reached a healthy
state again. No data loss.

Best,
Martin
On Thu, Oct 4, 2018 at 11:09 AM Paul Emmerich  wrote:
>
> Mons are also on a 30s timeout.
> Even a short loss of quorum isn‘t noticeable for ongoing IO.
>
> Paul
>
> > Am 04.10.2018 um 11:03 schrieb Martin Palma :
> >
> > Also monitor election? That is the most fear we have since the monitor
> > nodes will no see each other for that timespan...
> >> On Thu, Oct 4, 2018 at 10:21 AM Paul Emmerich  
> >> wrote:
> >>
> >> 10 seconds is far below any relevant timeout values (generally 20-30 
> >> seconds); so you will be fine without any special configuration.
> >>
> >> Paul
> >>
> >> Am 04.10.2018 um 09:38 schrieb Konstantin Shalygin :
> >>
>  What can we do of best handling this scenario to have minimal or no
>  impact on Ceph?
> 
>  We plan to set "noout", "nobackfill", "norebalance", "noscrub",
>  "nodeep",  "scrub" are there any other suggestions?
> >>>
> >>> ceph osd set noout
> >>>
> >>> ceph osd pause
> >>>
> >>>
> >>>
> >>> k
> >>>
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inconsistent directory content in cephfs

2018-10-05 Thread Paul Emmerich
Try running a scrub on that directory, that might yield more information.

ceph daemon mds.XXX scrub_path /path/in/cephfs recursive

Afterwards you can maybe try to repair it if it finds the error. Could
also be something completely different like a bug in the clients.

Paul
Am Fr., 5. Okt. 2018 um 12:57 Uhr schrieb Burkhard Linke
:
>
> Hi,
>
>
> a user just stumbled across a problem with directory content in cephfs
> (kernel client, ceph 12.2.8, one active, one standby-replay instance):
>
>
> root@host1:~# ls /ceph/sge-tmp/db/work/06/ | wc -l
> 224
> root@host1:~# uname -a
> Linux host1 4.13.0-32-generic #35~16.04.1-Ubuntu SMP Thu Jan 25 10:13:43
> UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>
>
> root@host2:~# ls /ceph/sge-tmp/db/work/06/ | wc -l
> 224
> root@host2:~# uname -a
> Linux host2 4.15.0-32-generic #35~16.04.1-Ubuntu SMP Fri Aug 10 21:54:34
> UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>
>
> root@host3:~# ls /ceph/sge-tmp/db/work/6c | wc -l
> 225
> root@host3:~# uname -a
> Linux host3 4.13.0-19-generic #22~16.04.1-Ubuntu SMP Mon Dec 4 15:35:18
> UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
>
>
> Three hosts, different kernel versions, and one extra directory entry on
> the third host. All host used the same mount configuration:
>
> # mount | grep ceph
> :/volumes on /ceph type ceph
> (rw,relatime,name=volumes,secret=,acl,readdir_max_entries=8192,readdir_max_bytes=4104304)
>
> MDS logs only contain '2018-10-05 12:43:55.565598 7f2b7c578700  1
> mds.ceph-storage-04 Updating MDS map to version 325550 from mon.0' about
> every few minutes, with increasing version numbers. ceph -w also shows
> the following warnings:
>
> 2018-10-05 12:25:06.955085 mon.ceph-storage-03 [WRN] Health check
> failed: 2 clients failing to respond to cache pressure (MDS_CLIENT_RECALL)
> 2018-10-05 12:26:18.895358 mon.ceph-storage-03 [INF] MDS health message
> cleared (mds.0): Client host1:volumes failing to respond to cache pressure
> 2018-10-05 12:26:18.895401 mon.ceph-storage-03 [INF] MDS health message
> cleared (mds.0): Client cb-pc10:volumes failing to respond to cache pressure
> 2018-10-05 12:26:19.415890 mon.ceph-storage-03 [INF] Health check
> cleared: MDS_CLIENT_RECALL (was: 2 clients failing to respond to cache
> pressure)
> 2018-10-05 12:26:19.415919 mon.ceph-storage-03 [INF] Cluster is now healthy
>
> Timestamps of the MDS log messages and the messages about cache pressure
> are equal, so I assume that the MDS map has a list of failing clients
> and thus gets updated.
>
>
> But this does not explain the difference in the directory content. All
> entries are subdirectories. I also tried to enforce renewal of cached
> information by drop the kernel caches on the affected host, but to no
> avail yet. Caps on the MDS have dropped from 3.2 million to 800k, so
> dropping was effective.
>
>
> Any hints on the root cause for this problem? I've also tested various
> other clientssome show 224 entries, some 225.
>
>
> Regards,
>
> Burkhard
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Inconsistent directory content in cephfs

2018-10-05 Thread Burkhard Linke

Hi,


a user just stumbled across a problem with directory content in cephfs 
(kernel client, ceph 12.2.8, one active, one standby-replay instance):



root@host1:~# ls /ceph/sge-tmp/db/work/06/ | wc -l
224
root@host1:~# uname -a
Linux host1 4.13.0-32-generic #35~16.04.1-Ubuntu SMP Thu Jan 25 10:13:43 
UTC 2018 x86_64 x86_64 x86_64 GNU/Linux



root@host2:~# ls /ceph/sge-tmp/db/work/06/ | wc -l
224
root@host2:~# uname -a
Linux host2 4.15.0-32-generic #35~16.04.1-Ubuntu SMP Fri Aug 10 21:54:34 
UTC 2018 x86_64 x86_64 x86_64 GNU/Linux



root@host3:~# ls /ceph/sge-tmp/db/work/6c | wc -l
225
root@host3:~# uname -a
Linux host3 4.13.0-19-generic #22~16.04.1-Ubuntu SMP Mon Dec 4 15:35:18 
UTC 2017 x86_64 x86_64 x86_64 GNU/Linux



Three hosts, different kernel versions, and one extra directory entry on 
the third host. All host used the same mount configuration:


# mount | grep ceph
:/volumes on /ceph type ceph 
(rw,relatime,name=volumes,secret=,acl,readdir_max_entries=8192,readdir_max_bytes=4104304)


MDS logs only contain '2018-10-05 12:43:55.565598 7f2b7c578700  1 
mds.ceph-storage-04 Updating MDS map to version 325550 from mon.0' about 
every few minutes, with increasing version numbers. ceph -w also shows 
the following warnings:


2018-10-05 12:25:06.955085 mon.ceph-storage-03 [WRN] Health check 
failed: 2 clients failing to respond to cache pressure (MDS_CLIENT_RECALL)
2018-10-05 12:26:18.895358 mon.ceph-storage-03 [INF] MDS health message 
cleared (mds.0): Client host1:volumes failing to respond to cache pressure
2018-10-05 12:26:18.895401 mon.ceph-storage-03 [INF] MDS health message 
cleared (mds.0): Client cb-pc10:volumes failing to respond to cache pressure
2018-10-05 12:26:19.415890 mon.ceph-storage-03 [INF] Health check 
cleared: MDS_CLIENT_RECALL (was: 2 clients failing to respond to cache 
pressure)

2018-10-05 12:26:19.415919 mon.ceph-storage-03 [INF] Cluster is now healthy

Timestamps of the MDS log messages and the messages about cache pressure 
are equal, so I assume that the MDS map has a list of failing clients 
and thus gets updated.



But this does not explain the difference in the directory content. All 
entries are subdirectories. I also tried to enforce renewal of cached 
information by drop the kernel caches on the affected host, but to no 
avail yet. Caps on the MDS have dropped from 3.2 million to 800k, so 
dropping was effective.



Any hints on the root cause for this problem? I've also tested various 
other clientssome show 224 entries, some 225.



Regards,

Burkhard


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Invalid bucket in reshard list

2018-10-05 Thread Alexandru Cucu
Hello,

I'm running a Luminous 12.2.7 cluster.

Wanted to reshard the index of an RGW bucket and accidentally typed
the name wrong.
Now in "radosgw-admin reshard list" I have a task for a bucket that
does not exist.

Can't process or cancel it:
# radosgw-admin reshard process
ERROR: failed to process reshard logs, error=(16) Device or resource busy

# radosgw-admin reshard cancel --bucket='restore-test'
could not get bucket info for bucket=restore-test
ERROR: could not init bucket: (2) No such file or directory

Had to create a bucket with that name, process the resharding task and
then delete the bucket.

Couldn't find an issue on http://tracker.ceph.com/ and can't open a new one.
Can someone help me with access or by opening a new issue?

Thanks,
Alex Cucu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cannot write to cephfs if some osd's are not available on the client network

2018-10-05 Thread Marc Roos
 

I guess then this waiting "quietly" should be looked at again, I am 
having load of 10 on this vm.

[@~]# uptime
 11:51:58 up 4 days,  1:35,  1 user,  load average: 10.00, 10.01, 10.05

[@~]# uname -a
Linux smb 3.10.0-862.11.6.el7.x86_64 #1 SMP Tue Aug 14 21:49:04 UTC 2018 
x86_64 x86_64 x86_64 GNU/Linux

[@~]# cat /etc/redhat-release
CentOS Linux release 7.5.1804 (Core)

[@~]# dmesg
[348948.927734] libceph: osd23 192.168.10.114:6810 socket closed (con 
state CONNECTING)
[348957.120090] libceph: osd27 192.168.10.114:6802 socket closed (con 
state CONNECTING)
[349010.370171] libceph: osd26 192.168.10.114:6806 socket closed (con 
state CONNECTING)
[349114.822301] libceph: osd24 192.168.10.114:6804 socket closed (con 
state CONNECTING)
[349141.447330] libceph: osd29 192.168.10.114:6812 socket closed (con 
state CONNECTING)
[349278.668658] libceph: osd25 192.168.10.114:6800 socket closed (con 
state CONNECTING)
[349440.467038] libceph: osd28 192.168.10.114:6808 socket closed (con 
state CONNECTING)
[349465.043957] libceph: osd23 192.168.10.114:6810 socket closed (con 
state CONNECTING)
[349473.236400] libceph: osd27 192.168.10.114:6802 socket closed (con 
state CONNECTING)
[349526.486408] libceph: osd26 192.168.10.114:6806 socket closed (con 
state CONNECTING)
[349630.938498] libceph: osd24 192.168.10.114:6804 socket closed (con 
state CONNECTING)
[349657.563561] libceph: osd29 192.168.10.114:6812 socket closed (con 
state CONNECTING)
[349794.784936] libceph: osd25 192.168.10.114:6800 socket closed (con 
state CONNECTING)
[349956.583300] libceph: osd28 192.168.10.114:6808 socket closed (con 
state CONNECTING)
[349981.160225] libceph: osd23 192.168.10.114:6810 socket closed (con 
state CONNECTING)
[349989.352510] libceph: osd27 192.168.10.114:6802 socket closed (con 
state CONNECTING)
..
..
..




-Original Message-
From: John Spray [mailto:jsp...@redhat.com] 
Sent: donderdag 27 september 2018 11:43
To: Marc Roos
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Cannot write to cephfs if some osd's are not 
available on the client network

On Thu, Sep 27, 2018 at 10:16 AM Marc Roos  
wrote:
>
>
> I have a test cluster and on a osd node I put a vm. The vm is using a 
> macvtap on the client network interface of the osd node. Making access 

> to local osd's impossible.
>
> the vm of course reports that it cannot access the local osd's. What I 

> am getting is:
>
> - I cannot reboot this vm normally, need to reset it.

When linux tries to shut down cleanly, part of that is flushing buffers 
from any mounted filesystem back to disk.  If you have a network 
filesystem mounted, and the network is unavailable, that can cause the 
process to block.  You can try forcibly unmounting before rebooting.

> - vm is reporting very high load.

The CPU load part is surprising -- in general Ceph clients should wait 
quietly when blocked, rather than spinning.

> I guess this should not be happening not? Because it should choose an 
> other available osd of the 3x replicated pool and just write the data 
> to that one?

No -- writes always go through the primary OSD for the PG being written 
to.  If an OSD goes down, then another OSD will become the primary.  In 
your case, the primary OSD is not going down, it's just being cut off 
from the client by the network, so the writes are blocking indefinitely.

John

>
>
>
>
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure coding with more chunks than servers

2018-10-05 Thread Caspar Smit
Hi Vlad,

You can check this blog:
http://cephnotes.ksperis.com/blog/2017/01/27/erasure-code-on-small-clusters

Note! Be aware that these settings do not automatically cover a node
failure.

Check out this thread why:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/024423.html

Kind regards,
Caspar


Op do 4 okt. 2018 om 20:27 schreef Vladimir Brik <
vladimir.b...@icecube.wisc.edu>:

> Hello
>
> I have a 5-server cluster and I am wondering if it's possible to create
> pool that uses k=5 m=2 erasure code. In my experiments, I ended up with
> pools whose pgs are stuck in creating+incomplete state even when I
> created the erasure code profile with --crush-failure-domain=osd.
>
> Assuming that what I want to do is possible, will CRUSH distribute
> chunks evenly among servers, so that if I need to bring one server down
> (e.g. reboot), clients' ability to write or read any object would not be
> disrupted? (I guess something would need to ensure that no server holds
> more than two chunks of an object)
>
> Thanks,
>
> Vlad
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds_cache_memory_limit value

2018-10-05 Thread Eugen Block

Hi,

you can monitor the cache size and see if the new values are applied:

ceph@mds:~> ceph daemon mds. cache status
{
"pool": {
"items": 106708834,
"bytes": 5828227058
}
}

You should also see in top (or similar tools) that the memory  
increases/decreases. From my experience the new config value is  
applied immediately.


Regards,
Eugen


Zitat von Hervé Ballans :


Hi all,

I have just configured a new value for 'mds_cache_memory_limit'. The  
output message tells "not observed, change may require restart".
So I'm not really sure, has the new value been taken into account  
directly or do I have to restart the mds daemons on each MDS node ?


$ sudo ceph tell mds.* injectargs '--mds_cache_memory_limit 17179869184';
2018-10-04 16:25:11.692131 7f3012ffd700  0 client.2226325  
ms_handle_reset on IP1:6804/2649460488
2018-10-04 16:25:11.714746 7f3013fff700  0 client.4154799  
ms_handle_reset on IP1:6804/2649460488
mds.mon1: mds_cache_memory_limit = '17179869184' (not observed,  
change may require restart)
2018-10-04 16:25:11.725028 7f3012ffd700  0 client.4154802  
ms_handle_reset on IP0:6805/997393445
2018-10-04 16:25:11.748790 7f3013fff700  0 client.4154805  
ms_handle_reset on IP0:6805/997393445
mds.mon0: mds_cache_memory_limit = '17179869184' (not observed,  
change may require restart)
2018-10-04 16:25:11.760127 7f3012ffd700  0 client.2226334  
ms_handle_reset on IP2:6801/2590484227
2018-10-04 16:25:11.787951 7f3013fff700  0 client.2226337  
ms_handle_reset on IP2:6801/2590484227
mds.mon2: mds_cache_memory_limit = '17179869184' (not observed,  
change may require restart)


Thanks,
Hervé




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] mds_cache_memory_limit value

2018-10-05 Thread Hervé Ballans

Hi all,

I have just configured a new value for 'mds_cache_memory_limit'. The 
output message tells "not observed, change may require restart".
So I'm not really sure, has the new value been taken into account 
directly or do I have to restart the mds daemons on each MDS node ?


$ sudo ceph tell mds.* injectargs '--mds_cache_memory_limit 17179869184';
2018-10-04 16:25:11.692131 7f3012ffd700  0 client.2226325 
ms_handle_reset on IP1:6804/2649460488
2018-10-04 16:25:11.714746 7f3013fff700  0 client.4154799 
ms_handle_reset on IP1:6804/2649460488
mds.mon1: mds_cache_memory_limit = '17179869184' (not observed, change 
may require restart)
2018-10-04 16:25:11.725028 7f3012ffd700  0 client.4154802 
ms_handle_reset on IP0:6805/997393445
2018-10-04 16:25:11.748790 7f3013fff700  0 client.4154805 
ms_handle_reset on IP0:6805/997393445
mds.mon0: mds_cache_memory_limit = '17179869184' (not observed, change 
may require restart)
2018-10-04 16:25:11.760127 7f3012ffd700  0 client.2226334 
ms_handle_reset on IP2:6801/2590484227
2018-10-04 16:25:11.787951 7f3013fff700  0 client.2226337 
ms_handle_reset on IP2:6801/2590484227
mds.mon2: mds_cache_memory_limit = '17179869184' (not observed, change 
may require restart)


Thanks,
Hervé

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com