[ceph-users] Bluestore zetascale vs rocksdb

2017-02-13 Thread Deepak Naidu
Folks,

Has anyone been using Bluestore with CephFS. If so, did you'll test with 
zetascale vs rocksdb. Any install steps/best practice is appreciated.

PS: I still see that Bluestore is "experimental feature" any timeline, when 
will it be GA stable.

--
Deepak


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph server with errors while deployment -- on jewel

2017-02-13 Thread frank

Hi,


We have a minimal ceph cluster setup 1 admin node 1 mon node and 2 osds. 
We use centos 7 on all OS on all servers.


Currently while deploying the servers, we recieve the below errors.

=

[root@admin-ceph ~]# ceph health detail
2017-02-13 16:14:49.652786 7f6b8c6b6700  0 -- :/2855134392 >> 
10.10.48.8:6789/0 pipe(0x7f6b88063e90 sd=3 :0 s=1 pgs=0 cs=0 l=1 
c=0x7f6b8805c500).fault
2017-02-13 16:14:52.651750 7f6b8c5b5700  0 -- :/2855134392 >> 
10.10.48.8:6789/0 pipe(0x7f6b7c000c80 sd=4 :0 s=1 pgs=0 cs=0 l=1 
c=0x7f6b7c001f90).fault
2017-02-13 16:14:55.652046 7f6b8c6b6700  0 -- :/2855134392 >> 
10.10.48.8:6789/0 pipe(0x7f6b7c0052b0 sd=4 :0 s=1 pgs=0 cs=0 l=1 
c=0x7f6b7c006570).fault




Is there any insight on what might be wrong. Also this is a project for 
our cloudstack storage. has there been any successful integration of 
ceph with cloud stack. Please let me know the details about the ceph 
installation steps that I should follow to trouble shoot this issue.



Regards,

Frank

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] After upgrading from 0.94.9 to Jewel 10.2.5 on Ubuntu 14.04 OSDs fail to start with a crash dump

2017-02-13 Thread Brad Hubbard
Capture a log with debug_osd at 30 (yes, that's correct, 30) and see
if that sheds more light on the issue.

On Tue, Feb 14, 2017 at 6:53 AM, Alfredo Colangelo
 wrote:
> Hi Ceph experts,
>
> after updating from ceph 0.94.9 to ceph 10.2.5 on Ubuntu 14.04, 2 out of 3
> osd processes are unable to start. On another machine the same happened but
> only on 1 out of 3 OSDs.
>
> The update procedure is done via ceph-deploy 1.5.37.
>
> Shouldn’t be a permissions problem, because before updating I do a chown
> 64045: 64045 on the osd disks /dev/sd[bcd] and on the (separate) journal
> partition on ssd /dev/sda[678]
>
> When upgrade procedure is completed the 3 ceph osd processes are still
> running, but if I restart them some of them refuses to start.
>
>
>
> The error in /var/log/ceph/ceph-osd.271.log is full of errors like this :
>
>
>
> 2017-02-13 09:47:17.590843 7fc57248f800  0 set uid:gid to 1001:1001
> (ceph:ceph)
>
> 2017-02-13 09:47:17.590859 7fc57248f800  0 ceph version 10.2.5
> (c461ee19ecbc0c5c330aca20f7392c9a00730367), process ceph-osd, pid 187128
>
> 2017-02-13 09:47:17.591356 7fc57248f800  0 pidfile_write: ignore empty
> --pid-file
>
> 2017-02-13 09:47:17.601186 7fc57248f800  0
> filestore(/var/lib/ceph/osd/ceph-271) backend xfs (magic 0x58465342)
>
> 2017-02-13 09:47:17.601530 7fc57248f800  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features: FIEMAP
> ioctl is disabled via 'filestore fiemap' config option
>
> 2017-02-13 09:47:17.601539 7fc57248f800  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features:
> SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
>
> 2017-02-13 09:47:17.601553 7fc57248f800  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features: splice
> is supported
>
> 2017-02-13 09:47:17.613611 7fc57248f800  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features:
> syncfs(2) syscall fully supported (by glibc and kernel)
>
> 2017-02-13 09:47:17.613673 7fc57248f800  0
> xfsfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_feature: extsize is
> disabled by conf
>
> 2017-02-13 09:47:17.614454 7fc57248f800  1 leveldb: Recovering log #6754
>
> 2017-02-13 09:47:17.672544 7fc57248f800  1 leveldb: Delete type=3 #6753
>
>
>
> 2017-02-13 09:47:17.672662 7fc57248f800  1 leveldb: Delete type=0 #6754
>
>
>
> 2017-02-13 09:47:17.673640 7fc57248f800  0
> filestore(/var/lib/ceph/osd/ceph-271) mount: enabling WRITEAHEAD journal
> mode: checkpoint is not enabled
>
> 2017-02-13 09:47:17.684464 7fc57248f800  0  cls/hello/cls_hello.cc:305:
> loading cls_hello
>
> 2017-02-13 09:47:17.688815 7fc57248f800  0 
> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
>
> 2017-02-13 09:47:17.694483 7fc57248f800 -1 osd/OSD.h: In function 'OSDMapRef
> OSDService::get_map(epoch_t)' thread 7fc57248f800 time 2017-02-13
> 09:47:17.692735
>
> osd/OSD.h: 885: FAILED assert(ret)
>
>
>
>  ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
>
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x8b) [0x55ea51744dab]
>
>  2: (OSDService::get_map(unsigned int)+0x3d) [0x55ea5114debd]
>
>  3: (OSD::init()+0x1ed2) [0x55ea51103872]
>
>  4: (main()+0x29d1) [0x55ea5106ae41]
>
>  5: (__libc_start_main()+0xf5) [0x7fc56f3b0f45]
>
>  6: (()+0x355b17) [0x55ea510b3b17]
>
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to
> interpret this.
>
>
>
> --- begin dump of recent events ---
>
>-29> 2017-02-13 09:47:17.587145 7fc57248f800  5 asok(0x55ea5d1f8280)
> register_command perfcounters_dump hook 0x55ea5d1d8050
>
>-28> 2017-02-13 09:47:17.587164 7fc57248f800  5 asok(0x55ea5d1f8280)
> register_command 1 hook 0x55ea5d1d8050
>
>-27> 2017-02-13 09:47:17.587166 7fc57248f800  5 asok(0x55ea5d1f8280)
> register_command perf dump hook 0x55ea5d1d8050
>
>-26> 2017-02-13 09:47:17.587168 7fc57248f800  5 asok(0x55ea5d1f8280)
> register_command perfcounters_schema hook 0x55ea5d1d8050
>
>-25> 2017-02-13 09:47:17.587170 7fc57248f800  5 asok(0x55ea5d1f8280)
> register_command 2 hook 0x55ea5d1d8050
>
>-24> 2017-02-13 09:47:17.587172 7fc57248f800  5 asok(0x55ea5d1f8280)
> register_command perf schema hook 0x55ea5d1d8050
>
>-23> 2017-02-13 09:47:17.587174 7fc57248f800  5 asok(0x55ea5d1f8280)
> register_command perf reset hook 0x55ea5d1d8050
>
>-22> 2017-02-13 09:47:17.587176 7fc57248f800  5 asok(0x55ea5d1f8280)
> register_command config show hook 0x55ea5d1d8050
>
>-21> 2017-02-13 09:47:17.587178 7fc57248f800  5 asok(0x55ea5d1f8280)
> register_command config set hook 0x55ea5d1d8050
>
>-20> 2017-02-13 09:47:17.587181 7fc57248f800  5 asok(0x55ea5d1f8280)
> register_command config get hook 0x55ea5d1d8050
>
>-19> 2017-02-13 09:47:17.587187 7fc57248f800  5 asok(0x55ea5d1f8280)
> register_command config diff hook 0x55ea5d1d8050
>
>-18> 2017-02-13 09:47:17.587189 7fc57248f800  5 asok(0x55ea5d1f8280)
> register_command log flush hook 0x55ea5d1d8050
>
>-17> 2017-02-13 09:47

Re: [ceph-users] PG stuck peering after host reboot

2017-02-13 Thread Brad Hubbard
I'd suggest creating a tracker and uploading a full debug log from the
primary so we can look at this in more detail.

On Mon, Feb 13, 2017 at 9:11 PM,   wrote:
> Hi Brad,
>
> I could not tell you that as `ceph pg 1.323 query` never completes, it just 
> hangs there.
>
> On 11/02/2017, 00:40, "Brad Hubbard"  wrote:
>
> On Thu, Feb 9, 2017 at 3:36 AM,   wrote:
> > Hi Corentin,
> >
> > I've tried that, the primary hangs when trying to injectargs so I set 
> the option in the config file and restarted all OSDs in the PG, it came up 
> with:
> >
> > pg 1.323 is remapped+peering, acting 
> [595,1391,2147483647,127,937,362,267,320,7,634,716]
> >
> > Still can't query the PG, no error messages in the logs of osd.240.
> > The logs on osd.595 and osd.7 still fill up with the same messages.
>
> So what does "peering_blocked_by_detail" show in that case since it
> can no longer show "peering_blocked_by_history_les_bound"?
>
> >
> > Regards,
> >
> > George
> > 
> > From: Corentin Bonneton [l...@titin.fr]
> > Sent: 08 February 2017 16:31
> > To: Vasilakakos, George (STFC,RAL,SC)
> > Cc: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] PG stuck peering after host reboot
> >
> > Hello,
> >
> > I already had the case, I applied the parameter 
> (osd_find_best_info_ignore_history_les) to all the osd that have reported the 
> queries blocked.
> >
> > --
> > Cordialement,
> > CEO FEELB | Corentin BONNETON
> > cont...@feelb.io
> >
> > Le 8 févr. 2017 à 17:17, 
> george.vasilaka...@stfc.ac.uk a écrit :
> >
> > Hi Ceph folks,
> >
> > I have a cluster running Jewel 10.2.5 using a mix EC and replicated 
> pools.
> >
> > After rebooting a host last night, one PG refuses to complete peering
> >
> > pg 1.323 is stuck inactive for 73352.498493, current state peering, 
> last acting [595,1391,240,127,937,362,267,320,7,634,716]
> >
> > Restarting OSDs or hosts does nothing to help, or sometimes results in 
> things like this:
> >
> > pg 1.323 is remapped+peering, acting 
> [2147483647,1391,240,127,937,362,267,320,7,634,716]
> >
> >
> > The host that was rebooted is home to osd.7 (8). If I go onto it to 
> look at the logs for osd.7 this is what I see:
> >
> > $ tail -f /var/log/ceph/ceph-osd.7.log
> > 2017-02-08 15:41:00.445247 7f5fcc2bd700  0 -- 
> XXX.XXX.XXX.172:6905/20510 >> XXX.XXX.XXX.192:6921/55371 pipe(0x7f6074a0b400 
> sd=34 :42828 s=2 pgs=319 cs=471 l=0 c=0x7f6070086700).fault, initiating 
> reconnect
> >
> > I'm assuming that in IP1:port1/PID1 >> IP2:port2/PID2 the >> indicates 
> the direction of communication. I've traced these to osd.7 (rank 8 in the 
> stuck PG) reaching out to osd.595 (the primary in the stuck PG).
> >
> > Meanwhile, looking at the logs of osd.595 I see this:
> >
> > $ tail -f /var/log/ceph/ceph-osd.595.log
> > 2017-02-08 15:41:15.760708 7f1765673700  0 -- 
> XXX.XXX.XXX.192:6921/55371 >> XXX.XXX.XXX.172:6905/20510 pipe(0x7f17b2911400 
> sd=101 :6921 s=0 pgs=0 cs=0 l=0 c=0x7f17b7beaf00).accept connect_seq 478 vs 
> existing 477 state standby
> > 2017-02-08 15:41:20.768844 7f1765673700  0 bad crc in front 1941070384 
> != exp 3786596716
> >
> > which again shows osd.595 reaching out to osd.7 and from what I could 
> gather the CRC problem is about messaging.
> >
> > Google searching has yielded nothing particularly useful on how to get 
> this unstuck.
> >
> > ceph pg 1.323 query seems to hang forever but it completed once last 
> night and I noticed this:
> >
> >"peering_blocked_by_detail": [
> >{
> >"detail": "peering_blocked_by_history_les_bound"
> >}
> >
> > We have seen this before and it was cleared by setting 
> osd_find_best_info_ignore_history_les to true for the first two OSDs on the 
> stuck PGs (this was on a 3 replica pool). This hasn't worked in this case and 
> I suspect the option needs to be set on either a majority of OSDs or enough k 
> number of OSDs to be able to use their data and ignore history.
> >
> > We would really appreciate any guidance and/or help the community can 
> offer!
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Cheers,
> Brad
>
>



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster

2017-02-13 Thread Brad Hubbard
Could one of the reporters open a tracker for this issue and attach
the requested debugging data?

On Mon, Feb 13, 2017 at 11:18 PM, Donny Davis  wrote:
> I am having the same issue. When I looked at my idle cluster this morning,
> one of the nodes had 400% cpu utilization, and ceph-mgr was 300% of that.  I
> have 3 AIO nodes, and only one of them seemed to be affected.
>
> On Sat, Jan 14, 2017 at 12:18 AM, Brad Hubbard  wrote:
>>
>> Want to install debuginfo packages and use something like this to try
>> and find out where it is spending most of its time?
>>
>> https://poormansprofiler.org/
>>
>> Note that you may need to do multiple runs to get a "feel" for where
>> it is spending most of its time. Also not that likely only one or two
>> threads will be using the CPU (you can see this in ps output using a
>> command like the following) the rest will likely be idle or waiting
>> for something.
>>
>> # ps axHo %cpu,stat,pid,tid,pgid,ppid,comm,wchan
>>
>> Observation of these two and maybe a couple of manual gstack dumps
>> like this to compare thread ids to ps output (LWP is the thread id
>> (tid) in gdb output) should give us some idea of where it is spinning.
>>
>> # gstack $(pidof ceph-mgr)
>>
>>
>> On Sat, Jan 14, 2017 at 9:54 AM, Robert Longstaff
>>  wrote:
>> > FYI, I'm seeing this as well on the latest Kraken 11.1.1 RPMs on CentOS
>> > 7 w/
>> > elrepo kernel 4.8.10. ceph-mgr is currently tearing through CPU and has
>> > allocated ~11GB of RAM after a single day of usage. Only the active
>> > manager
>> > is performing this way. The growth is linear and reproducible.
>> >
>> > The cluster is mostly idle; 3 mons (4 CPU, 16GB), 20 heads with 45x8TB
>> > OSDs
>> > each.
>> >
>> >
>> > top - 23:45:47 up 1 day,  1:32,  1 user,  load average: 3.56, 3.94, 4.21
>> >
>> > Tasks: 178 total,   1 running, 177 sleeping,   0 stopped,   0 zombie
>> >
>> > %Cpu(s): 33.9 us, 28.1 sy,  0.0 ni, 37.3 id,  0.0 wa,  0.0 hi,  0.7 si,
>> > 0.0
>> > st
>> >
>> > KiB Mem : 16423844 total,  3980500 free, 11556532 used,   886812
>> > buff/cache
>> >
>> > KiB Swap:  2097148 total,  2097148 free,0 used.  4836772 avail
>> > Mem
>> >
>> >
>> >   PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+
>> > COMMAND
>> >
>> >  2351 ceph  20   0 12.160g 0.010t  17380 S 203.7 64.8   2094:27
>> > ceph-mgr
>> >
>> >  2302 ceph  20   0  620316 267992 157620 S   2.3  1.6  65:11.50
>> > ceph-mon
>> >
>> >
>> > On Wed, Jan 11, 2017 at 12:00 PM, Stillwell, Bryan J
>> >  wrote:
>> >>
>> >> John,
>> >>
>> >> This morning I compared the logs from yesterday and I show a noticeable
>> >> increase in messages like these:
>> >>
>> >> 2017-01-11 09:00:03.032521 7f70f15c1700 10 mgr handle_mgr_digest 575
>> >> 2017-01-11 09:00:03.032523 7f70f15c1700 10 mgr handle_mgr_digest 441
>> >> 2017-01-11 09:00:03.032529 7f70f15c1700 10 mgr notify_all notify_all:
>> >> notify_all mon_status
>> >> 2017-01-11 09:00:03.032532 7f70f15c1700 10 mgr notify_all notify_all:
>> >> notify_all health
>> >> 2017-01-11 09:00:03.032534 7f70f15c1700 10 mgr notify_all notify_all:
>> >> notify_all pg_summary
>> >> 2017-01-11 09:00:03.033613 7f70f15c1700  4 mgr ms_dispatch active
>> >> mgrdigest v1
>> >> 2017-01-11 09:00:03.033618 7f70f15c1700 -1 mgr ms_dispatch mgrdigest v1
>> >> 2017-01-11 09:00:03.033620 7f70f15c1700 10 mgr handle_mgr_digest 575
>> >> 2017-01-11 09:00:03.033622 7f70f15c1700 10 mgr handle_mgr_digest 441
>> >> 2017-01-11 09:00:03.033628 7f70f15c1700 10 mgr notify_all notify_all:
>> >> notify_all mon_status
>> >> 2017-01-11 09:00:03.033631 7f70f15c1700 10 mgr notify_all notify_all:
>> >> notify_all health
>> >> 2017-01-11 09:00:03.033633 7f70f15c1700 10 mgr notify_all notify_all:
>> >> notify_all pg_summary
>> >> 2017-01-11 09:00:03.532898 7f70f15c1700  4 mgr ms_dispatch active
>> >> mgrdigest v1
>> >> 2017-01-11 09:00:03.532945 7f70f15c1700 -1 mgr ms_dispatch mgrdigest v1
>> >>
>> >>
>> >> In a 1 minute period yesterday I saw 84 times this group of messages
>> >> showed up.  Today that same group of messages showed up 156 times.
>> >>
>> >> Other than that I did see an increase in this messages from 9 times a
>> >> minute to 14 times a minute:
>> >>
>> >> 2017-01-11 09:00:00.402000 7f70f3d61700  0 -- 172.24.88.207:6800/4104
>> >> >> -
>> >> conn(0x563c9ee89000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0
>> >> l=0).fault with nothing to send and in the half  accept state just
>> >> closed
>> >>
>> >> Let me know if you need anything else.
>> >>
>> >> Bryan
>> >>
>> >>
>> >> On 1/10/17, 10:00 AM, "ceph-users on behalf of Stillwell, Bryan J"
>> >> > >> bryan.stillw...@charter.com> wrote:
>> >>
>> >> >On 1/10/17, 5:35 AM, "John Spray"  wrote:
>> >> >
>> >> >>On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J
>> >> >> wrote:
>> >> >>> Last week I decided to play around with Kraken (11.1.1-1xenial) on
>> >> >>> a
>> >> >>> single node, two OSD cluster, and after a while I noticed that the
>> >> >>> new
>> >> >>> ceph-mgr daemon is freq

[ceph-users] After upgrading from 0.94.9 to Jewel 10.2.5 on Ubuntu 14.04 OSDs fail to start with a crash dump

2017-02-13 Thread Alfredo Colangelo
Hi Ceph experts,

after updating from ceph 0.94.9 to ceph 10.2.5 on Ubuntu 14.04, 2 out of 3 osd 
processes are unable to start. On another machine the same happened but only on 
1 out of 3 OSDs.

The update procedure is done via ceph-deploy 1.5.37.

Shouldn’t be a permissions problem, because before updating I do a chown 64045: 
64045 on the osd disks /dev/sd[bcd] and on the (separate) journal partition on 
ssd /dev/sda[678]

When upgrade procedure is completed the 3 ceph osd processes are still running, 
but if I restart them some of them refuses to start.

 

The error in /var/log/ceph/ceph-osd.271.log is full of errors like this :

 

2017-02-13 09:47:17.590843 7fc57248f800  0 set uid:gid to 1001:1001 (ceph:ceph)

2017-02-13 09:47:17.590859 7fc57248f800  0 ceph version 10.2.5 
(c461ee19ecbc0c5c330aca20f7392c9a00730367), process ceph-osd, pid 187128

2017-02-13 09:47:17.591356 7fc57248f800  0 pidfile_write: ignore empty 
--pid-file

2017-02-13 09:47:17.601186 7fc57248f800  0 
filestore(/var/lib/ceph/osd/ceph-271) backend xfs (magic 0x58465342)

2017-02-13 09:47:17.601530 7fc57248f800  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features: FIEMAP 
ioctl is disabled via 'filestore fiemap' config option

2017-02-13 09:47:17.601539 7fc57248f800  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features: 
SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option

2017-02-13 09:47:17.601553 7fc57248f800  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features: splice is 
supported

2017-02-13 09:47:17.613611 7fc57248f800  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features: syncfs(2) 
syscall fully supported (by glibc and kernel)

2017-02-13 09:47:17.613673 7fc57248f800  0 
xfsfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_feature: extsize is 
disabled by conf

2017-02-13 09:47:17.614454 7fc57248f800  1 leveldb: Recovering log #6754

2017-02-13 09:47:17.672544 7fc57248f800  1 leveldb: Delete type=3 #6753

 

2017-02-13 09:47:17.672662 7fc57248f800  1 leveldb: Delete type=0 #6754

 

2017-02-13 09:47:17.673640 7fc57248f800  0 
filestore(/var/lib/ceph/osd/ceph-271) mount: enabling WRITEAHEAD journal mode: 
checkpoint is not enabled

2017-02-13 09:47:17.684464 7fc57248f800  0  cls/hello/cls_hello.cc:305: 
loading cls_hello

2017-02-13 09:47:17.688815 7fc57248f800  0  cls/cephfs/cls_cephfs.cc:202: 
loading cephfs_size_scan

2017-02-13 09:47:17.694483 7fc57248f800 -1 osd/OSD.h: In function 'OSDMapRef 
OSDService::get_map(epoch_t)' thread 7fc57248f800 time 2017-02-13 
09:47:17.692735

osd/OSD.h: 885: FAILED assert(ret)

 

 ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)

 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) 
[0x55ea51744dab]

 2: (OSDService::get_map(unsigned int)+0x3d) [0x55ea5114debd]

 3: (OSD::init()+0x1ed2) [0x55ea51103872]

 4: (main()+0x29d1) [0x55ea5106ae41]

 5: (__libc_start_main()+0xf5) [0x7fc56f3b0f45]

 6: (()+0x355b17) [0x55ea510b3b17]

 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

 

--- begin dump of recent events ---

   -29> 2017-02-13 09:47:17.587145 7fc57248f800  5 asok(0x55ea5d1f8280) 
register_command perfcounters_dump hook 0x55ea5d1d8050

   -28> 2017-02-13 09:47:17.587164 7fc57248f800  5 asok(0x55ea5d1f8280) 
register_command 1 hook 0x55ea5d1d8050

   -27> 2017-02-13 09:47:17.587166 7fc57248f800  5 asok(0x55ea5d1f8280) 
register_command perf dump hook 0x55ea5d1d8050

   -26> 2017-02-13 09:47:17.587168 7fc57248f800  5 asok(0x55ea5d1f8280) 
register_command perfcounters_schema hook 0x55ea5d1d8050

   -25> 2017-02-13 09:47:17.587170 7fc57248f800  5 asok(0x55ea5d1f8280) 
register_command 2 hook 0x55ea5d1d8050

   -24> 2017-02-13 09:47:17.587172 7fc57248f800  5 asok(0x55ea5d1f8280) 
register_command perf schema hook 0x55ea5d1d8050

   -23> 2017-02-13 09:47:17.587174 7fc57248f800  5 asok(0x55ea5d1f8280) 
register_command perf reset hook 0x55ea5d1d8050

   -22> 2017-02-13 09:47:17.587176 7fc57248f800  5 asok(0x55ea5d1f8280) 
register_command config show hook 0x55ea5d1d8050

   -21> 2017-02-13 09:47:17.587178 7fc57248f800  5 asok(0x55ea5d1f8280) 
register_command config set hook 0x55ea5d1d8050

   -20> 2017-02-13 09:47:17.587181 7fc57248f800  5 asok(0x55ea5d1f8280) 
register_command config get hook 0x55ea5d1d8050

   -19> 2017-02-13 09:47:17.587187 7fc57248f800  5 asok(0x55ea5d1f8280) 
register_command config diff hook 0x55ea5d1d8050

   -18> 2017-02-13 09:47:17.587189 7fc57248f800  5 asok(0x55ea5d1f8280) 
register_command log flush hook 0x55ea5d1d8050

   -17> 2017-02-13 09:47:17.587191 7fc57248f800  5 asok(0x55ea5d1f8280) 
register_command log dump hook 0x55ea5d1d8050

   -16> 2017-02-13 09:47:17.587195 7fc57248f800  5 asok(0x55ea5d1f8280) 
register_command log reopen hook 0x55ea5d1d8050

   -15> 2017-02-13 09:47:17.590843 7fc57248f800  0 set uid:gid to 1001:1001 
(ceph:ceph)

   -14> 2017-02-13 09:47:17.590859 7fc57

[ceph-users] radosgw 100-continue problem

2017-02-13 Thread Z Will
Hi:
I used nginx + fastcti  + radosgw , when configure radosgw with "rgw
print continue = true " In RFC 2616 , it says  An origin server that
sends a 100 (Continue) response MUST ultimately send a final status
code, once the request body is received and processed, unless it
terminates the transport connection prematurely. But in RADOSGW,  for
PUT requests, when radosgw receive EXCEPT: 100-continue, it responds
100-continue , the code is like this
int RGWFCGX::send_status(int status, const char *status_name)
{
  status_num = status;
  return print("Status: %d %s\r\n", status, status_name);
}
int RGWFCGX::send_100_continue()
{
  int r = send_status(100, "Continue");
  if (r >= 0) {
flush();
  }
  return r;
}

there is just one \r\n, this means the header is not end. So when it
run to send_response(), it continue to send the rest header. I wonder
if this is the cause. Nginx doesn't work due to this.  I think this
should split into two response.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 答复: 答复: mon is stuck in leveldb and costs nearly 100% cpu

2017-02-13 Thread Shinobu Kinjo
> 2 active+clean+scrubbing+deep

 * Set noscrub and nodeep-scrub
  # ceph osd set noscrub
  # ceph osd set nodeep-scrub

 * Wait for scrubbing+deep to complete

 * Do `ceph -s`

If still you would be seeing high CPU usage, please identify who
is/are eating CPU resource.

 * ps aux | sort -rk 3,4 | head -n 20

And let us know.


On Mon, Feb 13, 2017 at 9:39 PM, Chenyehua  wrote:
> Thanks for the response, Shinobu
> The warning disappears due to your suggesting solution, however the nearly 
> 100% cpu cost still exists and concerns me a lot.
> So, do you know why the cpu cost is so high?
> Are there any solutions or suggestions to this problem?
>
> Cheers
>
> -邮件原件-
> 发件人: Shinobu Kinjo [mailto:ski...@redhat.com]
> 发送时间: 2017年2月13日 10:54
> 收件人: chenyehua 11692 (RD)
> 抄送: kc...@redhat.com; ceph-users@lists.ceph.com
> 主题: Re: 答复: [ceph-users] mon is stuck in leveldb and costs nearly 100% cpu
>
> O.k, that's reasonable answer. Would you do on all hosts which the MON are 
> running on:
>
>  #* ceph --admin-daemon /var/run/ceph/ceph-mon.`hostname -s`.asok config show 
> | grep leveldb_log
>
> Anyway you can compact leveldb size with at runtime:
>
>  #* ceph tell mon.`hostname -s` compact
>
> And you should set in ceph.conf to prevent same issue from the next:
>
>  #* [mon]
>  #* mon compact on start = true
>
>
> On Mon, Feb 13, 2017 at 11:37 AM, Chenyehua  wrote:
>> Sorry, I made a mistake, the ceph version is actually 0.94.5
>>
>> -邮件原件-
>> 发件人: chenyehua 11692 (RD)
>> 发送时间: 2017年2月13日 9:40
>> 收件人: 'Shinobu Kinjo'
>> 抄送: kc...@redhat.com; ceph-users@lists.ceph.com
>> 主题: 答复: [ceph-users] mon is stuck in leveldb and costs nearly 100% cpu
>>
>> My ceph version is 10.2.5
>>
>> -邮件原件-
>> 发件人: Shinobu Kinjo [mailto:ski...@redhat.com]
>> 发送时间: 2017年2月12日 13:12
>> 收件人: chenyehua 11692 (RD)
>> 抄送: kc...@redhat.com; ceph-users@lists.ceph.com
>> 主题: Re: [ceph-users] mon is stuck in leveldb and costs nearly 100% cpu
>>
>> Which Ceph version are you using?
>>
>> On Sat, Feb 11, 2017 at 5:02 PM, Chenyehua  wrote:
>>> Dear Mr Kefu Chai
>>>
>>> Sorry to disturb you.
>>>
>>> I meet a problem recently. In my ceph cluster ,health status has
>>> warning “store is getting too big!” for several days; and  ceph-mon
>>> costs nearly 100% cpu;
>>>
>>> Have you ever met this situation?
>>>
>>> Some detailed information are attached below:
>>>
>>>
>>>
>>> root@cvknode17:~# ceph -s
>>>
>>> cluster 04afba60-3a77-496c-b616-2ecb5e47e141
>>>
>>>  health HEALTH_WARN
>>>
>>> mon.cvknode17 store is getting too big! 34104 MB >= 15360
>>> MB
>>>
>>>  monmap e1: 3 mons at
>>> {cvknode15=172.16.51.15:6789/0,cvknode16=172.16.51.16:6789/0,cvknode1
>>> 7
>>> =172.16.51.17:6789/0}
>>>
>>> election epoch 862, quorum 0,1,2
>>> cvknode15,cvknode16,cvknode17
>>>
>>>  osdmap e196279: 347 osds: 347 up, 347 in
>>>
>>>   pgmap v5891025: 33272 pgs, 16 pools, 26944 GB data, 6822
>>> kobjects
>>>
>>> 65966 GB used, 579 TB / 644 TB avail
>>>
>>>33270 active+clean
>>>
>>>2 active+clean+scrubbing+deep
>>>
>>>   client io 840 kB/s rd, 739 kB/s wr, 35 op/s rd, 184 op/s wr
>>>
>>>
>>>
>>> root@cvknode17:~# top
>>>
>>> top - 15:19:28 up 23 days, 23:58,  6 users,  load average: 1.08,
>>> 1.40,
>>> 1.77
>>>
>>> Tasks: 346 total,   2 running, 342 sleeping,   0 stopped,   2 zombie
>>>
>>> Cpu(s):  8.1%us, 10.8%sy,  0.0%ni, 69.0%id,  9.5%wa,  0.0%hi,
>>> 2.5%si, 0.0%st
>>>
>>> Mem:  65384424k total, 58102880k used,  7281544k free,   240720k buffers
>>>
>>> Swap: 2100k total,   344944k used, 29654156k free, 24274272k cached
>>>
>>>
>>>
>>> PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
>>>
>>>   24407 root  20   0 17.3g  12g  10m S   98 20.2   8420:11 ceph-mon
>>>
>>>
>>>
>>> root@cvknode17:~# top -Hp 24407
>>>
>>> top - 15:19:49 up 23 days, 23:59,  6 users,  load average: 1.12,
>>> 1.39,
>>> 1.76
>>>
>>> Tasks:  17 total,   1 running,  16 sleeping,   0 stopped,   0 zombie
>>>
>>> Cpu(s):  8.1%us, 10.8%sy,  0.0%ni, 69.0%id,  9.5%wa,  0.0%hi,
>>> 2.5%si, 0.0%st
>>>
>>> Mem:  65384424k total, 58104868k used,  7279556k free,   240744k buffers
>>>
>>> Swap: 2100k total,   344944k used, 29654156k free, 24271188k cached
>>>
>>>
>>>
>>> PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
>>>
>>>   25931 root  20   0 17.3g  12g   9m R   98 20.2   7957:37 ceph-mon
>>>
>>>   24514 root  20   0 17.3g  12g   9m S2 20.2   3:06.75 ceph-mon
>>>
>>>   25932 root  20   0 17.3g  12g   9m S2 20.2   1:07.82 ceph-mon
>>>
>>>   24407 root  20   0 17.3g  12g   9m S0 20.2   0:00.67 ceph-mon
>>>
>>>   24508 root  20   0 17.3g  12g   9m S0 20.2  15:50.24 ceph-mon
>>>
>>>   24513 root  20   0 17.3g  12g   9m S0 20.2   0:07.88 ceph-mon
>>>
>>>   24534 root  20   0 17.3g  12g   9m S0 20.2 196:33.85 ceph-mon
>>>
>>>   24535 root  20   0 17.3g  12g   9m S0 20.2   0:00.01 ceph-mon

Re: [ceph-users] SMR disks go 100% busy after ~15 minutes

2017-02-13 Thread Bernhard J . M . Grün
Hi Wido,

no I did not set special flags - I've used ceph-deploy without further
parameters apart from the journal disk/partition that these OSDs should use.

Bernhard

Wido den Hollander  schrieb am Mo., 13. Feb. 2017 um
17:47 Uhr:

>
> > Op 13 februari 2017 om 16:49 schreef "Bernhard J. M. Grün" <
> bernhard.gr...@gmail.com>:
> >
> >
> > Hi,
> >
> > we are using SMR disks for backup purposes in our Ceph cluster.
> > We have had massive problems with those disks prior to upgrading to
> Kernel
> > 4.9.x. We also dropped XFS as filesystem and we now use btrfs (only for
> > those disks).
> > Since we did this we don't have such problems anymore.
> >
>
> We have kernel 4.9 there, but XFS is not SMR-aware so it doesn't help.
>
> I saw posts that some XFS work is on it's way, but it's not being actively
> developed. What I saw however is that you need to issue some flags on mkfs.
>
> Did you need to do that when formatting btrfs on the SMR disks?
>
> Wido
>
> > If you don't like btrfs you could try to use a journal disk for XFS
> itself
> > and also a journal disk for Ceph. I assume this will also solve many
> > problems as the XFS journal is rewritten often and SMR disks don't like
> > rewrites.
> > I think that is one reason why btrfs works smoother with those disks.
> >
> > Hope this helps
> >
> > Bernhard
> >
> > Wido den Hollander  schrieb am Mo., 13. Feb. 2017 um
> > 16:11 Uhr:
> >
> > >
> > > > Op 13 februari 2017 om 15:57 schreef Peter Maloney <
> > > peter.malo...@brockmann-consult.de>:
> > > >
> > > >
> > > > Then you're not aware of what the SMR disks do. They are just slow
> for
> > > > all writes, having to read the tracks around, then write it all again
> > > > instead of just the one thing you really wanted to write, due to
> > > > overlap. Then to partially mitigate this, they have some tiny write
> > > > buffer like 8GB flash, and then they use that for the "normal" speed,
> > > > and then when it's full, you crawl (at least this is what the seagate
> > > > ones do). Journals aren't designed to solve that... they help prevent
> > > > the sync load on the osd, but don't somehow make the throughput
> higher
> > > > (at least not sustained). Even if the journal was perfectly designed
> for
> > > > performance, it would still do absolutely nothing if it's full and
> the
> > > > disk is still busy with the old flushing.
> > > >
> > >
> > > Well, that explains indeed. I wasn't aware of the additional buffer
> inside
> > > a SMR disk.
> > >
> > > I was asked to look at this system for somebody who bought SMR disks
> > > without knowing. As I never touch these disks I found the behavior odd.
> > >
> > > The buffer explains it a lot better, wasn't aware that SMR disks have
> that.
> > >
> > > SMR shouldn't be used in Ceph without proper support in Bluestore or
> XFS
> > > aware SMR.
> > >
> > > Wido
> > >
> > > >
> > > > On 02/13/17 15:49, Wido den Hollander wrote:
> > > > > Hi,
> > > > >
> > > > > I have a odd case with SMR disks in a Ceph cluster. Before I
> continue,
> > > yes, I am fully aware of SMR and Ceph not playing along well, but
> there is
> > > something happening which I'm not able to fully explain.
> > > > >
> > > > > On a 2x replica cluster with 8TB Seagate SMR disks I can write with
> > > about 30MB/sec to each disk using a simple RADOS bench:
> > > > >
> > > > > $ rados bench -t 1
> > > > > $ time rados put 1GB.bin
> > > > >
> > > > > Both ways I found out that the disk can write at that rate.
> > > > >
> > > > > Now, when I start a benchmark with 32 threads it writes fine. Not
> > > super fast, but it works.
> > > > >
> > > > > After 15 minutes or so various disks go to 100% busy and just stay
> > > there. These OSDs are being marked as down and some even commit
> suicide due
> > > to threads timing out.
> > > > >
> > > > > Stopping the RADOS bench and starting the OSDs again resolves the
> > > situation.
> > > > >
> > > > > I am trying to explain what's happening. I'm aware that SMR isn't
> very
> > > good at Random Writes. To partially overcome this there are Intel DC
> 3510s
> > > in there as Journal SSDs.
> > > > >
> > > > > Can anybody explain why this 100% busy pops up after 15 minutes or
> so?
> > > > >
> > > > > Obviously it would the best if BlueStore had SMR support, but for
> now
> > > it's just Filestore with XFS on there.
> > > > >
> > > > > Wido
> > > > > ___
> > > > > ceph-users mailing list
> > > > > ceph-users@lists.ceph.com
> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >
> > > >
> > > > ___
> > > > ceph-users mailing list
> > > > ceph-users@lists.ceph.com
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > --
> > Freundliche Grüße
> >
> > Bernhard J. M. Grün, Pü

Re: [ceph-users] - permission denied on journal after reboot

2017-02-13 Thread Piotr Dzionek
Ok, Partition GUID code was the same like Partition unique GUID. I used 
the|sudo sgdisk --new=1:0:+20480M --change-name=1:'ceph journal' 
--partition-guid=1:$journal_uuid --typecode=1:$journal_uuid --mbrtogpt 
-- /dev/sdk|   to recreate my journal. However,  typecode part should be 
the 45B0969E-9B03-4F30-B4C6-B4B80CEFF106, not the journal_uuid. I guess 
this tutorial is for the old ceph, which didn't run as a ceph user, but 
as a root user. Thanks for your help.


Kind regards,
Piotr Dzionek

W dniu 13.02.2017 o 16:38, ulem...@polarzone.de pisze:

Hi Piotr,
is your partition GUID right?

Look with sgdisk:
# sgdisk --info=2 /dev/sdd
Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown)
Partition unique GUID: 396A0C50-738C-449E-9FC6-B2D3A4469E51
First sector: 2048 (at 1024.0 KiB)
Last sector: 10485760 (at 5.0 GiB)
Partition size: 10483713 sectors (5.0 GiB)
Attribute flags: 
Partition name: 'ceph journal'

# sgdisk --info=2 /dev/sdc
Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown)
Partition unique GUID: 31E9A040-A2C2-4F8F-906E-19D8A24DBDAB
First sector: 2048 (at 1024.0 KiB)
Last sector: 10485760 (at 5.0 GiB)
Partition size: 10483713 sectors (5.0 GiB)
Attribute flags: 
Partition name: 'ceph journal'



Udo

Am 2017-02-13 16:13, schrieb Piotr Dzionek:

I run it on CentOS Linux release 7.3.1611. After running "udevadm test
/sys/block/sda/sda1" I don't see that this rule apply to this disk.

Hmm I remember that it used to work properly, but some time ago I
retested journal disk recreation. I followed the same tutorial like
the one pasted here by Wido den Hollander :

"The udev rules of Ceph should chown the journal to ceph:ceph if it's
set to the right partition UUID.
This blog shows it
partially:http://ceph.com/planet/ceph-recover-osds-after-ssd-journal-failure/"; 



I think that my journals were not recreated in a proper way, but I
don't know what is missing.

My SSD journal disk looks like this:
/Disk /dev/sda: 120GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number  Start   End SizeFile system  Name Flags
 1  1049kB  27.3GB  27.3GB   ceph journal
 2  27.3GB  54.6GB  27.3GB   ceph journal
 3  54.6GB  81.9GB  27.3GB   ceph journal
 4  81.9GB  109GB   27.3GB   ceph journal/

and blkid:
blkid | grep sda
/dev/sda1: PARTLABEL="ceph journal"
PARTUUID="a5ea6883-b2b2-4d53-b8ba-9ff8bcddead5"
/dev/sda2: PARTLABEL="ceph journal"
PARTUUID="adae4442-380c-418c-bdc0-05890fcf633e"
/dev/sda3: PARTLABEL="ceph journal"
PARTUUID="a8637452-fd9c-4d68-924f-69a43c75442c"
/dev/sda4: PARTLABEL="ceph journal"
PARTUUID="615a208a-19e0-4e02-8ef3-19d618a71103"

Do you have any idea what may be wrong?

W dniu 13.02.2017 o 12:45, Craig Chi pisze:

Hi,
What is your OS? The permission of journal partition should be 
changed by udev rules: /lib/udev/rules.d/95-ceph-osd.rules

In this file, it is described as:
# JOURNAL_UUID
ACTION=="add", SUBSYSTEM=="block", \
  ENV{DEVTYPE}=="partition", \
ENV{ID_PART_ENTRY_TYPE}=="45b0969e-9b03-4f30-b4c6-b4b80ceff106", \
  OWNER:="ceph", GROUP:="ceph", MODE:="660", \
  RUN+="/usr/sbin/ceph-disk --log-stdout -v trigger /dev/$name"
You can also use udevadm command to test whether the partition has 
been processed by the correct udev rule. Like following:

#> udevadm test /sys/block/sdb/sdb2
...
starting 'probe-bcache -o udev /dev/sdb2'
Process 'probe-bcache -o udev /dev/sdb2' succeeded.
OWNER 64045 /lib/udev/rules.d/95-ceph-osd.rules:16
GROUP 64045 /lib/udev/rules.d/95-ceph-osd.rules:16
MODE 0660 /lib/udev/rules.d/95-ceph-osd.rules:16
RUN '/usr/sbin/ceph-disk --log-stdout -v trigger /dev/$name' 
/lib/udev/rules.d/95-ceph-osd.rules:16

...
Then /dev/sdb2 will have ceph:ceph permission automatically.
#> ls -l /dev/sdb2
brw-rw 1 ceph ceph 8, 18 Feb 13 19:43 /dev/sdb2
Sincerely,
Craig Chi
On 2017-02-13 19:06, Piotr Dzionek  wrote:

Hi,

I am running ceph Jewel 10.2.5 with separate journals - ssd disks.
It runs pretty smooth, however I stumble upon an issue after
system reboot. Journal disks become owned by root and ceph failed
to start.

/starting osd.4 at :/0 osd_data /var/lib/ceph/osd/ceph-4
/var/lib/ceph/osd/ceph-4/journal//
/ /2017-02-10 16:24:29.924126 7fd07ab40800 -1
filestore(/var/lib/ceph/osd/ceph-4) mount failed to open journal
/var/lib/ceph/osd/ceph-4/journal: (13) Permission denied//
/ /2017-02-10 16:24:29.924210 7fd07ab40800 -1 osd.4 0 OSD:init:
unable to mount object store//
/ /2017-02-10 16:24:29.924217 7fd07ab40800 -1 #033[0;31m ** ERROR:
osd init failed: (13) Permission denied#033[0m/

I fixed this issue by finding journal disks in /dev dir and chown
to ceph:ceph. I remember that I had a similar issue after I
installed it for a first time. Is it a bug ? or do I have to set
some kind of udev rules for this disks?

FYI, I have this issue after every rest

Re: [ceph-users] SMR disks go 100% busy after ~15 minutes

2017-02-13 Thread Wido den Hollander

> Op 13 februari 2017 om 16:49 schreef "Bernhard J. M. Grün" 
> :
> 
> 
> Hi,
> 
> we are using SMR disks for backup purposes in our Ceph cluster.
> We have had massive problems with those disks prior to upgrading to Kernel
> 4.9.x. We also dropped XFS as filesystem and we now use btrfs (only for
> those disks).
> Since we did this we don't have such problems anymore.
> 

We have kernel 4.9 there, but XFS is not SMR-aware so it doesn't help.

I saw posts that some XFS work is on it's way, but it's not being actively 
developed. What I saw however is that you need to issue some flags on mkfs.

Did you need to do that when formatting btrfs on the SMR disks?

Wido

> If you don't like btrfs you could try to use a journal disk for XFS itself
> and also a journal disk for Ceph. I assume this will also solve many
> problems as the XFS journal is rewritten often and SMR disks don't like
> rewrites.
> I think that is one reason why btrfs works smoother with those disks.
> 
> Hope this helps
> 
> Bernhard
> 
> Wido den Hollander  schrieb am Mo., 13. Feb. 2017 um
> 16:11 Uhr:
> 
> >
> > > Op 13 februari 2017 om 15:57 schreef Peter Maloney <
> > peter.malo...@brockmann-consult.de>:
> > >
> > >
> > > Then you're not aware of what the SMR disks do. They are just slow for
> > > all writes, having to read the tracks around, then write it all again
> > > instead of just the one thing you really wanted to write, due to
> > > overlap. Then to partially mitigate this, they have some tiny write
> > > buffer like 8GB flash, and then they use that for the "normal" speed,
> > > and then when it's full, you crawl (at least this is what the seagate
> > > ones do). Journals aren't designed to solve that... they help prevent
> > > the sync load on the osd, but don't somehow make the throughput higher
> > > (at least not sustained). Even if the journal was perfectly designed for
> > > performance, it would still do absolutely nothing if it's full and the
> > > disk is still busy with the old flushing.
> > >
> >
> > Well, that explains indeed. I wasn't aware of the additional buffer inside
> > a SMR disk.
> >
> > I was asked to look at this system for somebody who bought SMR disks
> > without knowing. As I never touch these disks I found the behavior odd.
> >
> > The buffer explains it a lot better, wasn't aware that SMR disks have that.
> >
> > SMR shouldn't be used in Ceph without proper support in Bluestore or XFS
> > aware SMR.
> >
> > Wido
> >
> > >
> > > On 02/13/17 15:49, Wido den Hollander wrote:
> > > > Hi,
> > > >
> > > > I have a odd case with SMR disks in a Ceph cluster. Before I continue,
> > yes, I am fully aware of SMR and Ceph not playing along well, but there is
> > something happening which I'm not able to fully explain.
> > > >
> > > > On a 2x replica cluster with 8TB Seagate SMR disks I can write with
> > about 30MB/sec to each disk using a simple RADOS bench:
> > > >
> > > > $ rados bench -t 1
> > > > $ time rados put 1GB.bin
> > > >
> > > > Both ways I found out that the disk can write at that rate.
> > > >
> > > > Now, when I start a benchmark with 32 threads it writes fine. Not
> > super fast, but it works.
> > > >
> > > > After 15 minutes or so various disks go to 100% busy and just stay
> > there. These OSDs are being marked as down and some even commit suicide due
> > to threads timing out.
> > > >
> > > > Stopping the RADOS bench and starting the OSDs again resolves the
> > situation.
> > > >
> > > > I am trying to explain what's happening. I'm aware that SMR isn't very
> > good at Random Writes. To partially overcome this there are Intel DC 3510s
> > in there as Journal SSDs.
> > > >
> > > > Can anybody explain why this 100% busy pops up after 15 minutes or so?
> > > >
> > > > Obviously it would the best if BlueStore had SMR support, but for now
> > it's just Filestore with XFS on there.
> > > >
> > > > Wido
> > > > ___
> > > > ceph-users mailing list
> > > > ceph-users@lists.ceph.com
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > >
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> -- 
> Freundliche Grüße
> 
> Bernhard J. M. Grün, Püttlingen, Deutschland
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] - permission denied on journal after reboot

2017-02-13 Thread Wido den Hollander

> Op 13 februari 2017 om 16:38 schreef ulem...@polarzone.de:
> 
> 
> Hi Piotr,
> is your partition GUID right?
> 
> Look with sgdisk:
> # sgdisk --info=2 /dev/sdd
> Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown)
> Partition unique GUID: 396A0C50-738C-449E-9FC6-B2D3A4469E51
> First sector: 2048 (at 1024.0 KiB)
> Last sector: 10485760 (at 5.0 GiB)
> Partition size: 10483713 sectors (5.0 GiB)
> Attribute flags: 
> Partition name: 'ceph journal'
> 
> # sgdisk --info=2 /dev/sdc
> Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown)
> Partition unique GUID: 31E9A040-A2C2-4F8F-906E-19D8A24DBDAB
> First sector: 2048 (at 1024.0 KiB)
> Last sector: 10485760 (at 5.0 GiB)
> Partition size: 10483713 sectors (5.0 GiB)
> Attribute flags: 
> Partition name: 'ceph journal'
> 

For the record, that's done by this UDEV rule:

# JOURNAL_UUID
ACTION=="add", SUBSYSTEM=="block", \
  ENV{DEVTYPE}=="partition", \
  ENV{ID_PART_ENTRY_TYPE}=="45b0969e-9b03-4f30-b4c6-b4b80ceff106", \
  OWNER:="ceph", GROUP:="ceph", MODE:="660", \
  RUN+="/usr/sbin/ceph-disk --log-stdout -v trigger /dev/$name"
ACTION=="change", SUBSYSTEM=="block", \
  ENV{ID_PART_ENTRY_TYPE}=="45b0969e-9b03-4f30-b4c6-b4b80ceff106", \
  OWNER="ceph", GROUP="ceph", MODE="660"

Wido

> 
> 
> Udo
> 
> Am 2017-02-13 16:13, schrieb Piotr Dzionek:
> > I run it on CentOS Linux release 7.3.1611. After running "udevadm test
> > /sys/block/sda/sda1" I don't see that this rule apply to this disk.
> > 
> > Hmm I remember that it used to work properly, but some time ago I
> > retested journal disk recreation. I followed the same tutorial like
> > the one pasted here by Wido den Hollander :
> > 
> > "The udev rules of Ceph should chown the journal to ceph:ceph if it's
> > set to the right partition UUID.
> > This blog shows it
> > partially:http://ceph.com/planet/ceph-recover-osds-after-ssd-journal-failure/";
> > 
> > I think that my journals were not recreated in a proper way, but I
> > don't know what is missing.
> > 
> > My SSD journal disk looks like this:
> > /Disk /dev/sda: 120GB
> > Sector size (logical/physical): 512B/512B
> > Partition Table: gpt
> > Disk Flags:
> > 
> > Number  Start   End SizeFile system  Name Flags
> >  1  1049kB  27.3GB  27.3GB   ceph journal
> >  2  27.3GB  54.6GB  27.3GB   ceph journal
> >  3  54.6GB  81.9GB  27.3GB   ceph journal
> >  4  81.9GB  109GB   27.3GB   ceph journal/
> > 
> > and blkid:
> > blkid | grep sda
> > /dev/sda1: PARTLABEL="ceph journal"
> > PARTUUID="a5ea6883-b2b2-4d53-b8ba-9ff8bcddead5"
> > /dev/sda2: PARTLABEL="ceph journal"
> > PARTUUID="adae4442-380c-418c-bdc0-05890fcf633e"
> > /dev/sda3: PARTLABEL="ceph journal"
> > PARTUUID="a8637452-fd9c-4d68-924f-69a43c75442c"
> > /dev/sda4: PARTLABEL="ceph journal"
> > PARTUUID="615a208a-19e0-4e02-8ef3-19d618a71103"
> > 
> > Do you have any idea what may be wrong?
> > 
> > W dniu 13.02.2017 o 12:45, Craig Chi pisze:
> >> Hi,
> >> What is your OS? The permission of journal partition should be changed 
> >> by udev rules: /lib/udev/rules.d/95-ceph-osd.rules
> >> In this file, it is described as:
> >> # JOURNAL_UUID
> >> ACTION=="add", SUBSYSTEM=="block", \
> >>   ENV{DEVTYPE}=="partition", \
> >> ENV{ID_PART_ENTRY_TYPE}=="45b0969e-9b03-4f30-b4c6-b4b80ceff106", \
> >>   OWNER:="ceph", GROUP:="ceph", MODE:="660", \
> >>   RUN+="/usr/sbin/ceph-disk --log-stdout -v trigger /dev/$name"
> >> You can also use udevadm command to test whether the partition has 
> >> been processed by the correct udev rule. Like following:
> >> #> udevadm test /sys/block/sdb/sdb2
> >> ...
> >> starting 'probe-bcache -o udev /dev/sdb2'
> >> Process 'probe-bcache -o udev /dev/sdb2' succeeded.
> >> OWNER 64045 /lib/udev/rules.d/95-ceph-osd.rules:16
> >> GROUP 64045 /lib/udev/rules.d/95-ceph-osd.rules:16
> >> MODE 0660 /lib/udev/rules.d/95-ceph-osd.rules:16
> >> RUN '/usr/sbin/ceph-disk --log-stdout -v trigger /dev/$name' 
> >> /lib/udev/rules.d/95-ceph-osd.rules:16
> >> ...
> >> Then /dev/sdb2 will have ceph:ceph permission automatically.
> >> #> ls -l /dev/sdb2
> >> brw-rw 1 ceph ceph 8, 18 Feb 13 19:43 /dev/sdb2
> >> Sincerely,
> >> Craig Chi
> >> On 2017-02-13 19:06, Piotr Dzionek  wrote:
> >> 
> >> Hi,
> >> 
> >> I am running ceph Jewel 10.2.5 with separate journals - ssd disks.
> >> It runs pretty smooth, however I stumble upon an issue after
> >> system reboot. Journal disks become owned by root and ceph failed
> >> to start.
> >> 
> >> /starting osd.4 at :/0 osd_data /var/lib/ceph/osd/ceph-4
> >> /var/lib/ceph/osd/ceph-4/journal//
> >> / /2017-02-10 16:24:29.924126 7fd07ab40800 -1
> >> filestore(/var/lib/ceph/osd/ceph-4) mount failed to open journal
> >> /var/lib/ceph/osd/ceph-4/journal: (13) Permission denied//
> >> / /2017-02-10 16:24:29.924210 7fd07ab40800 -1 osd.4 0 OSD:init:
> >> unable to mount object store//
> >> 

Re: [ceph-users] SMR disks go 100% busy after ~15 minutes

2017-02-13 Thread Bernhard J . M . Grün
Hi,

we are using SMR disks for backup purposes in our Ceph cluster.
We have had massive problems with those disks prior to upgrading to Kernel
4.9.x. We also dropped XFS as filesystem and we now use btrfs (only for
those disks).
Since we did this we don't have such problems anymore.

If you don't like btrfs you could try to use a journal disk for XFS itself
and also a journal disk for Ceph. I assume this will also solve many
problems as the XFS journal is rewritten often and SMR disks don't like
rewrites.
I think that is one reason why btrfs works smoother with those disks.

Hope this helps

Bernhard

Wido den Hollander  schrieb am Mo., 13. Feb. 2017 um
16:11 Uhr:

>
> > Op 13 februari 2017 om 15:57 schreef Peter Maloney <
> peter.malo...@brockmann-consult.de>:
> >
> >
> > Then you're not aware of what the SMR disks do. They are just slow for
> > all writes, having to read the tracks around, then write it all again
> > instead of just the one thing you really wanted to write, due to
> > overlap. Then to partially mitigate this, they have some tiny write
> > buffer like 8GB flash, and then they use that for the "normal" speed,
> > and then when it's full, you crawl (at least this is what the seagate
> > ones do). Journals aren't designed to solve that... they help prevent
> > the sync load on the osd, but don't somehow make the throughput higher
> > (at least not sustained). Even if the journal was perfectly designed for
> > performance, it would still do absolutely nothing if it's full and the
> > disk is still busy with the old flushing.
> >
>
> Well, that explains indeed. I wasn't aware of the additional buffer inside
> a SMR disk.
>
> I was asked to look at this system for somebody who bought SMR disks
> without knowing. As I never touch these disks I found the behavior odd.
>
> The buffer explains it a lot better, wasn't aware that SMR disks have that.
>
> SMR shouldn't be used in Ceph without proper support in Bluestore or XFS
> aware SMR.
>
> Wido
>
> >
> > On 02/13/17 15:49, Wido den Hollander wrote:
> > > Hi,
> > >
> > > I have a odd case with SMR disks in a Ceph cluster. Before I continue,
> yes, I am fully aware of SMR and Ceph not playing along well, but there is
> something happening which I'm not able to fully explain.
> > >
> > > On a 2x replica cluster with 8TB Seagate SMR disks I can write with
> about 30MB/sec to each disk using a simple RADOS bench:
> > >
> > > $ rados bench -t 1
> > > $ time rados put 1GB.bin
> > >
> > > Both ways I found out that the disk can write at that rate.
> > >
> > > Now, when I start a benchmark with 32 threads it writes fine. Not
> super fast, but it works.
> > >
> > > After 15 minutes or so various disks go to 100% busy and just stay
> there. These OSDs are being marked as down and some even commit suicide due
> to threads timing out.
> > >
> > > Stopping the RADOS bench and starting the OSDs again resolves the
> situation.
> > >
> > > I am trying to explain what's happening. I'm aware that SMR isn't very
> good at Random Writes. To partially overcome this there are Intel DC 3510s
> in there as Journal SSDs.
> > >
> > > Can anybody explain why this 100% busy pops up after 15 minutes or so?
> > >
> > > Obviously it would the best if BlueStore had SMR support, but for now
> it's just Filestore with XFS on there.
> > >
> > > Wido
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
-- 
Freundliche Grüße

Bernhard J. M. Grün, Püttlingen, Deutschland
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 1 PG stuck unclean (active+remapped) after OSD replacement

2017-02-13 Thread Eugen Block

Thanks for your quick responses,

while I was writing my answer we had a rebalancing going on because I  
started a new crush reweight to get rid of the old re-activated OSDs  
again, and now that it finished, the cluster is back in healthy state.


Thanks,
Eugen

Zitat von Gregory Farnum :


On Mon, Feb 13, 2017 at 7:05 AM Wido den Hollander  wrote:



> Op 13 februari 2017 om 16:03 schreef Eugen Block :
>
>
> Hi experts,
>
> I have a strange situation right now. We are re-organizing our 4 node
> Hammer cluster from LVM-based OSDs to HDDs. When we did this on the
> first node last week, everything went smoothly, I removed the OSDs
> from the crush map and the rebalancing and recovery finished
> successfully.
> This weekend we did the same with the second node, we created the
> HDD-based OSDs and added them to the cluster, waited for rebalancing
> to finish and then stopped the old OSDs. Only this time the recovery
> didn't completely finish, 4 PGs kept stuck unclean. I found out that 3
> of these 4 PGs had their primary OSD on that node. So I restarted the
> respective services and those 3 PGs recovered successfully. But there
> is one last PG that gives me headaches.
>
> ceph@ndesan01:~ # ceph pg map 1.3d3
> osdmap e24320 pg 1.3d3 (1.3d3) -> up [16,21] acting [16,21,0]
>

What version of Ceph? And could it be that the cluster has old CRUSH
tunables? When was it installed with which Ceph version?



I'm not sure it even takes old tunables. With half the weight in one bucket
(that last host) it's going to have trouble. Assuming things will balance
out when the transition is done, I'd just keep going, especially since the
three acting replicas are sticking around.
-Greg




Wido

> ceph@ndesan01:~/ceph-deploy> ceph osd tree
> ID WEIGHT  TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 9.38985 root default
> -2 1.19995 host ndesan01
>   0 0.23999 osd.0  up  1.0  1.0
>   1 0.23999 osd.1  up  1.0  1.0
>   2 0.23999 osd.2  up  1.0  1.0
> 13 0.23999 osd.13 up  1.0  1.0
> 19 0.23999 osd.19 up  1.0  1.0
> -3 1.81998 host ndesan02
>   3   0 osd.3down0  1.0
>   4   0 osd.4down0  1.0
>   5   0 osd.5down0  1.0
>   9   0 osd.9down  1.0  1.0
> 10   0 osd.10   down  1.0  1.0
>   6 0.90999 osd.6  up  1.0  1.0
>   7 0.90999 osd.7  up  1.0  1.0
> -4 1.81998 host nde32
> 20 0.90999 osd.20 up  1.0  1.0
> 21 0.90999 osd.21 up  1.0  1.0
> -5 4.54994 host ndesan03
> 14 0.90999 osd.14 up  1.0  1.0
> 15 0.90999 osd.15 up  1.0  1.0
> 16 0.90999 osd.16 up  1.0  1.0
> 17 0.90999 osd.17 up  1.0  1.0
> 18 0.90999 osd.18 up  1.0  1.0
>
>
> All OSDs marked as "down" are going to be removed. I looked for that
> PG on all 3 nodes, and all of them have it. All services are up and
> running, but for some reason this PG is not aware of that. Is there
> any reasonable explanation and/or some advice how to get that PG
> recovered?
>
> One thing I noticed:
>
> The data on the primary OSD (osd.16) had different timestamps than on
> the other two OSDs:
>
> ---cut here---
> ndesan03:~ # ls -rtl /var/lib/ceph/osd/ceph-16/current/1.3d3_head/
> total 389436
> -rw-r--r-- 1 root root   0 Jul 12  2016 __head_03D3__1
> ...
> -rw-r--r-- 1 root root   0 Jan  9 10:43
> rbd\udata.bca465368d6b49.0a06__head_20EFF3D3__1
> -rw-r--r-- 1 root root   0 Jan  9 10:43
> rbd\udata.bca465368d6b49.0a8b__head_A014F3D3__1
> -rw-r--r-- 1 root root   0 Jan  9 10:44
> rbd\udata.bca465368d6b49.0e2c__head_00F2D3D3__1
> -rw-r--r-- 1 root root   0 Jan  9 10:44
> rbd\udata.bca465368d6b49.0e6a__head_C91813D3__1
> -rw-r--r-- 1 root root 8388608 Jan 20 13:53
> rbd\udata.cc94344e6afb66.08cb__head_6AA4B3D3__1
> -rw-r--r-- 1 root root 8388608 Jan 20 14:47
> rbd\udata.e15aee238e1f29.05f0__head_C95063D3__1
> -rw-r--r-- 1 root root 8388608 Jan 20 15:10
> rbd\udata.e15aee238e1f29.0d15__head_FF1083D3__1
> -rw-r--r-- 1 root root 8388608 Jan 20 15:19
> rbd\udata.e15aee238e1f29.100c__head_6B17F3D3__1
> -rw-r--r-- 1 root root 8388608 Jan 23 14:17
> rbd\udata.e73cf7b03e0c6.0479__head_C16003D3__1
> -rw-r--r-- 1 root root 8388608 Jan 25 11:52
> rbd\udata.d4edc95e884adc.00f4__head_00EE43D3__1
> -rw-r--r-- 1 root root 4194304 Jan 27 08:07
> rbd\udata.34595be2237e6.0ad5__head_D3CC93D3__1
> -rw-r--r-- 

Re: [ceph-users] - permission denied on journal after reboot

2017-02-13 Thread ulembke

Hi Piotr,
is your partition GUID right?

Look with sgdisk:
# sgdisk --info=2 /dev/sdd
Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown)
Partition unique GUID: 396A0C50-738C-449E-9FC6-B2D3A4469E51
First sector: 2048 (at 1024.0 KiB)
Last sector: 10485760 (at 5.0 GiB)
Partition size: 10483713 sectors (5.0 GiB)
Attribute flags: 
Partition name: 'ceph journal'

# sgdisk --info=2 /dev/sdc
Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown)
Partition unique GUID: 31E9A040-A2C2-4F8F-906E-19D8A24DBDAB
First sector: 2048 (at 1024.0 KiB)
Last sector: 10485760 (at 5.0 GiB)
Partition size: 10483713 sectors (5.0 GiB)
Attribute flags: 
Partition name: 'ceph journal'



Udo

Am 2017-02-13 16:13, schrieb Piotr Dzionek:

I run it on CentOS Linux release 7.3.1611. After running "udevadm test
/sys/block/sda/sda1" I don't see that this rule apply to this disk.

Hmm I remember that it used to work properly, but some time ago I
retested journal disk recreation. I followed the same tutorial like
the one pasted here by Wido den Hollander :

"The udev rules of Ceph should chown the journal to ceph:ceph if it's
set to the right partition UUID.
This blog shows it
partially:http://ceph.com/planet/ceph-recover-osds-after-ssd-journal-failure/";

I think that my journals were not recreated in a proper way, but I
don't know what is missing.

My SSD journal disk looks like this:
/Disk /dev/sda: 120GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number  Start   End SizeFile system  Name Flags
 1  1049kB  27.3GB  27.3GB   ceph journal
 2  27.3GB  54.6GB  27.3GB   ceph journal
 3  54.6GB  81.9GB  27.3GB   ceph journal
 4  81.9GB  109GB   27.3GB   ceph journal/

and blkid:
blkid | grep sda
/dev/sda1: PARTLABEL="ceph journal"
PARTUUID="a5ea6883-b2b2-4d53-b8ba-9ff8bcddead5"
/dev/sda2: PARTLABEL="ceph journal"
PARTUUID="adae4442-380c-418c-bdc0-05890fcf633e"
/dev/sda3: PARTLABEL="ceph journal"
PARTUUID="a8637452-fd9c-4d68-924f-69a43c75442c"
/dev/sda4: PARTLABEL="ceph journal"
PARTUUID="615a208a-19e0-4e02-8ef3-19d618a71103"

Do you have any idea what may be wrong?

W dniu 13.02.2017 o 12:45, Craig Chi pisze:

Hi,
What is your OS? The permission of journal partition should be changed 
by udev rules: /lib/udev/rules.d/95-ceph-osd.rules

In this file, it is described as:
# JOURNAL_UUID
ACTION=="add", SUBSYSTEM=="block", \
  ENV{DEVTYPE}=="partition", \
ENV{ID_PART_ENTRY_TYPE}=="45b0969e-9b03-4f30-b4c6-b4b80ceff106", \
  OWNER:="ceph", GROUP:="ceph", MODE:="660", \
  RUN+="/usr/sbin/ceph-disk --log-stdout -v trigger /dev/$name"
You can also use udevadm command to test whether the partition has 
been processed by the correct udev rule. Like following:

#> udevadm test /sys/block/sdb/sdb2
...
starting 'probe-bcache -o udev /dev/sdb2'
Process 'probe-bcache -o udev /dev/sdb2' succeeded.
OWNER 64045 /lib/udev/rules.d/95-ceph-osd.rules:16
GROUP 64045 /lib/udev/rules.d/95-ceph-osd.rules:16
MODE 0660 /lib/udev/rules.d/95-ceph-osd.rules:16
RUN '/usr/sbin/ceph-disk --log-stdout -v trigger /dev/$name' 
/lib/udev/rules.d/95-ceph-osd.rules:16

...
Then /dev/sdb2 will have ceph:ceph permission automatically.
#> ls -l /dev/sdb2
brw-rw 1 ceph ceph 8, 18 Feb 13 19:43 /dev/sdb2
Sincerely,
Craig Chi
On 2017-02-13 19:06, Piotr Dzionek  wrote:

Hi,

I am running ceph Jewel 10.2.5 with separate journals - ssd disks.
It runs pretty smooth, however I stumble upon an issue after
system reboot. Journal disks become owned by root and ceph failed
to start.

/starting osd.4 at :/0 osd_data /var/lib/ceph/osd/ceph-4
/var/lib/ceph/osd/ceph-4/journal//
/ /2017-02-10 16:24:29.924126 7fd07ab40800 -1
filestore(/var/lib/ceph/osd/ceph-4) mount failed to open journal
/var/lib/ceph/osd/ceph-4/journal: (13) Permission denied//
/ /2017-02-10 16:24:29.924210 7fd07ab40800 -1 osd.4 0 OSD:init:
unable to mount object store//
/ /2017-02-10 16:24:29.924217 7fd07ab40800 -1 #033[0;31m ** ERROR:
osd init failed: (13) Permission denied#033[0m/

I fixed this issue by finding journal disks in /dev dir and chown
to ceph:ceph. I remember that I had a similar issue after I
installed it for a first time. Is it a bug ? or do I have to set
some kind of udev rules for this disks?

FYI, I have this issue after every restart now.

Kind regards,
Piotr Dzionek

  ___ ceph-users 
mailing

list ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] - permission denied on journal after reboot

2017-02-13 Thread Piotr Dzionek
I run it on CentOS Linux release 7.3.1611. After running "udevadm test 
/sys/block/sda/sda1" I don't see that this rule apply to this disk.


Hmm I remember that it used to work properly, but some time ago I 
retested journal disk recreation. I followed the same tutorial like the 
one pasted here by Wido den Hollander :


"The udev rules of Ceph should chown the journal to ceph:ceph if it's set to 
the right partition UUID.
This blog shows it 
partially:http://ceph.com/planet/ceph-recover-osds-after-ssd-journal-failure/";

I think that my journals were not recreated in a proper way, but I don't 
know what is missing.


My SSD journal disk looks like this:
/Disk /dev/sda: 120GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number  Start   End SizeFile system  Name Flags
 1  1049kB  27.3GB  27.3GB   ceph journal
 2  27.3GB  54.6GB  27.3GB   ceph journal
 3  54.6GB  81.9GB  27.3GB   ceph journal
 4  81.9GB  109GB   27.3GB   ceph journal/

and blkid:
blkid | grep sda
/dev/sda1: PARTLABEL="ceph journal" 
PARTUUID="a5ea6883-b2b2-4d53-b8ba-9ff8bcddead5"
/dev/sda2: PARTLABEL="ceph journal" 
PARTUUID="adae4442-380c-418c-bdc0-05890fcf633e"
/dev/sda3: PARTLABEL="ceph journal" 
PARTUUID="a8637452-fd9c-4d68-924f-69a43c75442c"
/dev/sda4: PARTLABEL="ceph journal" 
PARTUUID="615a208a-19e0-4e02-8ef3-19d618a71103"


Do you have any idea what may be wrong?

W dniu 13.02.2017 o 12:45, Craig Chi pisze:

Hi,
What is your OS? The permission of journal partition should be changed 
by udev rules: /lib/udev/rules.d/95-ceph-osd.rules

In this file, it is described as:
# JOURNAL_UUID
ACTION=="add", SUBSYSTEM=="block", \
  ENV{DEVTYPE}=="partition", \
ENV{ID_PART_ENTRY_TYPE}=="45b0969e-9b03-4f30-b4c6-b4b80ceff106", \
  OWNER:="ceph", GROUP:="ceph", MODE:="660", \
  RUN+="/usr/sbin/ceph-disk --log-stdout -v trigger /dev/$name"
You can also use udevadm command to test whether the partition has 
been processed by the correct udev rule. Like following:

#> udevadm test /sys/block/sdb/sdb2
...
starting 'probe-bcache -o udev /dev/sdb2'
Process 'probe-bcache -o udev /dev/sdb2' succeeded.
OWNER 64045 /lib/udev/rules.d/95-ceph-osd.rules:16
GROUP 64045 /lib/udev/rules.d/95-ceph-osd.rules:16
MODE 0660 /lib/udev/rules.d/95-ceph-osd.rules:16
RUN '/usr/sbin/ceph-disk --log-stdout -v trigger /dev/$name' 
/lib/udev/rules.d/95-ceph-osd.rules:16

...
Then /dev/sdb2 will have ceph:ceph permission automatically.
#> ls -l /dev/sdb2
brw-rw 1 ceph ceph 8, 18 Feb 13 19:43 /dev/sdb2
Sincerely,
Craig Chi
On 2017-02-13 19:06, Piotr Dzionek  wrote:

Hi,

I am running ceph Jewel 10.2.5 with separate journals - ssd disks.
It runs pretty smooth, however I stumble upon an issue after
system reboot. Journal disks become owned by root and ceph failed
to start.

/starting osd.4 at :/0 osd_data /var/lib/ceph/osd/ceph-4
/var/lib/ceph/osd/ceph-4/journal//
/ /2017-02-10 16:24:29.924126 7fd07ab40800 -1
filestore(/var/lib/ceph/osd/ceph-4) mount failed to open journal
/var/lib/ceph/osd/ceph-4/journal: (13) Permission denied//
/ /2017-02-10 16:24:29.924210 7fd07ab40800 -1 osd.4 0 OSD:init:
unable to mount object store//
/ /2017-02-10 16:24:29.924217 7fd07ab40800 -1 #033[0;31m ** ERROR:
osd init failed: (13) Permission denied#033[0m/

I fixed this issue by finding journal disks in /dev dir and chown
to ceph:ceph. I remember that I had a similar issue after I
installed it for a first time. Is it a bug ? or do I have to set
some kind of udev rules for this disks?

FYI, I have this issue after every restart now.

Kind regards,
Piotr Dzionek

  


___ ceph-users mailing
list ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 1 PG stuck unclean (active+remapped) after OSD replacement

2017-02-13 Thread Gregory Farnum
On Mon, Feb 13, 2017 at 7:05 AM Wido den Hollander  wrote:

>
> > Op 13 februari 2017 om 16:03 schreef Eugen Block :
> >
> >
> > Hi experts,
> >
> > I have a strange situation right now. We are re-organizing our 4 node
> > Hammer cluster from LVM-based OSDs to HDDs. When we did this on the
> > first node last week, everything went smoothly, I removed the OSDs
> > from the crush map and the rebalancing and recovery finished
> > successfully.
> > This weekend we did the same with the second node, we created the
> > HDD-based OSDs and added them to the cluster, waited for rebalancing
> > to finish and then stopped the old OSDs. Only this time the recovery
> > didn't completely finish, 4 PGs kept stuck unclean. I found out that 3
> > of these 4 PGs had their primary OSD on that node. So I restarted the
> > respective services and those 3 PGs recovered successfully. But there
> > is one last PG that gives me headaches.
> >
> > ceph@ndesan01:~ # ceph pg map 1.3d3
> > osdmap e24320 pg 1.3d3 (1.3d3) -> up [16,21] acting [16,21,0]
> >
>
> What version of Ceph? And could it be that the cluster has old CRUSH
> tunables? When was it installed with which Ceph version?
>

I'm not sure it even takes old tunables. With half the weight in one bucket
(that last host) it's going to have trouble. Assuming things will balance
out when the transition is done, I'd just keep going, especially since the
three acting replicas are sticking around.
-Greg



> Wido
>
> > ceph@ndesan01:~/ceph-deploy> ceph osd tree
> > ID WEIGHT  TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
> > -1 9.38985 root default
> > -2 1.19995 host ndesan01
> >   0 0.23999 osd.0  up  1.0  1.0
> >   1 0.23999 osd.1  up  1.0  1.0
> >   2 0.23999 osd.2  up  1.0  1.0
> > 13 0.23999 osd.13 up  1.0  1.0
> > 19 0.23999 osd.19 up  1.0  1.0
> > -3 1.81998 host ndesan02
> >   3   0 osd.3down0  1.0
> >   4   0 osd.4down0  1.0
> >   5   0 osd.5down0  1.0
> >   9   0 osd.9down  1.0  1.0
> > 10   0 osd.10   down  1.0  1.0
> >   6 0.90999 osd.6  up  1.0  1.0
> >   7 0.90999 osd.7  up  1.0  1.0
> > -4 1.81998 host nde32
> > 20 0.90999 osd.20 up  1.0  1.0
> > 21 0.90999 osd.21 up  1.0  1.0
> > -5 4.54994 host ndesan03
> > 14 0.90999 osd.14 up  1.0  1.0
> > 15 0.90999 osd.15 up  1.0  1.0
> > 16 0.90999 osd.16 up  1.0  1.0
> > 17 0.90999 osd.17 up  1.0  1.0
> > 18 0.90999 osd.18 up  1.0  1.0
> >
> >
> > All OSDs marked as "down" are going to be removed. I looked for that
> > PG on all 3 nodes, and all of them have it. All services are up and
> > running, but for some reason this PG is not aware of that. Is there
> > any reasonable explanation and/or some advice how to get that PG
> > recovered?
> >
> > One thing I noticed:
> >
> > The data on the primary OSD (osd.16) had different timestamps than on
> > the other two OSDs:
> >
> > ---cut here---
> > ndesan03:~ # ls -rtl /var/lib/ceph/osd/ceph-16/current/1.3d3_head/
> > total 389436
> > -rw-r--r-- 1 root root   0 Jul 12  2016 __head_03D3__1
> > ...
> > -rw-r--r-- 1 root root   0 Jan  9 10:43
> > rbd\udata.bca465368d6b49.0a06__head_20EFF3D3__1
> > -rw-r--r-- 1 root root   0 Jan  9 10:43
> > rbd\udata.bca465368d6b49.0a8b__head_A014F3D3__1
> > -rw-r--r-- 1 root root   0 Jan  9 10:44
> > rbd\udata.bca465368d6b49.0e2c__head_00F2D3D3__1
> > -rw-r--r-- 1 root root   0 Jan  9 10:44
> > rbd\udata.bca465368d6b49.0e6a__head_C91813D3__1
> > -rw-r--r-- 1 root root 8388608 Jan 20 13:53
> > rbd\udata.cc94344e6afb66.08cb__head_6AA4B3D3__1
> > -rw-r--r-- 1 root root 8388608 Jan 20 14:47
> > rbd\udata.e15aee238e1f29.05f0__head_C95063D3__1
> > -rw-r--r-- 1 root root 8388608 Jan 20 15:10
> > rbd\udata.e15aee238e1f29.0d15__head_FF1083D3__1
> > -rw-r--r-- 1 root root 8388608 Jan 20 15:19
> > rbd\udata.e15aee238e1f29.100c__head_6B17F3D3__1
> > -rw-r--r-- 1 root root 8388608 Jan 23 14:17
> > rbd\udata.e73cf7b03e0c6.0479__head_C16003D3__1
> > -rw-r--r-- 1 root root 8388608 Jan 25 11:52
> > rbd\udata.d4edc95e884adc.00f4__head_00EE43D3__1
> > -rw-r--r-- 1 root root 4194304 Jan 27 08:07
> > rbd\udata.34595be2237e6.0ad5__head_D3CC93D3__1
> > -rw-r--r-- 1 root root 4194304 Jan 27 08:08
> > rbd\udata.34595be2237e6.0aff__head_3BF633D3__1
> > -rw-r--r-- 1

Re: [ceph-users] 1 PG stuck unclean (active+remapped) after OSD replacement

2017-02-13 Thread Wido den Hollander

> Op 13 februari 2017 om 16:03 schreef Eugen Block :
> 
> 
> Hi experts,
> 
> I have a strange situation right now. We are re-organizing our 4 node  
> Hammer cluster from LVM-based OSDs to HDDs. When we did this on the  
> first node last week, everything went smoothly, I removed the OSDs  
> from the crush map and the rebalancing and recovery finished  
> successfully.
> This weekend we did the same with the second node, we created the  
> HDD-based OSDs and added them to the cluster, waited for rebalancing  
> to finish and then stopped the old OSDs. Only this time the recovery  
> didn't completely finish, 4 PGs kept stuck unclean. I found out that 3  
> of these 4 PGs had their primary OSD on that node. So I restarted the  
> respective services and those 3 PGs recovered successfully. But there  
> is one last PG that gives me headaches.
> 
> ceph@ndesan01:~ # ceph pg map 1.3d3
> osdmap e24320 pg 1.3d3 (1.3d3) -> up [16,21] acting [16,21,0]
> 

What version of Ceph? And could it be that the cluster has old CRUSH tunables? 
When was it installed with which Ceph version?

Wido

> ceph@ndesan01:~/ceph-deploy> ceph osd tree
> ID WEIGHT  TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 9.38985 root default
> -2 1.19995 host ndesan01
>   0 0.23999 osd.0  up  1.0  1.0
>   1 0.23999 osd.1  up  1.0  1.0
>   2 0.23999 osd.2  up  1.0  1.0
> 13 0.23999 osd.13 up  1.0  1.0
> 19 0.23999 osd.19 up  1.0  1.0
> -3 1.81998 host ndesan02
>   3   0 osd.3down0  1.0
>   4   0 osd.4down0  1.0
>   5   0 osd.5down0  1.0
>   9   0 osd.9down  1.0  1.0
> 10   0 osd.10   down  1.0  1.0
>   6 0.90999 osd.6  up  1.0  1.0
>   7 0.90999 osd.7  up  1.0  1.0
> -4 1.81998 host nde32
> 20 0.90999 osd.20 up  1.0  1.0
> 21 0.90999 osd.21 up  1.0  1.0
> -5 4.54994 host ndesan03
> 14 0.90999 osd.14 up  1.0  1.0
> 15 0.90999 osd.15 up  1.0  1.0
> 16 0.90999 osd.16 up  1.0  1.0
> 17 0.90999 osd.17 up  1.0  1.0
> 18 0.90999 osd.18 up  1.0  1.0
> 
> 
> All OSDs marked as "down" are going to be removed. I looked for that  
> PG on all 3 nodes, and all of them have it. All services are up and  
> running, but for some reason this PG is not aware of that. Is there  
> any reasonable explanation and/or some advice how to get that PG  
> recovered?
> 
> One thing I noticed:
> 
> The data on the primary OSD (osd.16) had different timestamps than on  
> the other two OSDs:
> 
> ---cut here---
> ndesan03:~ # ls -rtl /var/lib/ceph/osd/ceph-16/current/1.3d3_head/
> total 389436
> -rw-r--r-- 1 root root   0 Jul 12  2016 __head_03D3__1
> ...
> -rw-r--r-- 1 root root   0 Jan  9 10:43  
> rbd\udata.bca465368d6b49.0a06__head_20EFF3D3__1
> -rw-r--r-- 1 root root   0 Jan  9 10:43  
> rbd\udata.bca465368d6b49.0a8b__head_A014F3D3__1
> -rw-r--r-- 1 root root   0 Jan  9 10:44  
> rbd\udata.bca465368d6b49.0e2c__head_00F2D3D3__1
> -rw-r--r-- 1 root root   0 Jan  9 10:44  
> rbd\udata.bca465368d6b49.0e6a__head_C91813D3__1
> -rw-r--r-- 1 root root 8388608 Jan 20 13:53  
> rbd\udata.cc94344e6afb66.08cb__head_6AA4B3D3__1
> -rw-r--r-- 1 root root 8388608 Jan 20 14:47  
> rbd\udata.e15aee238e1f29.05f0__head_C95063D3__1
> -rw-r--r-- 1 root root 8388608 Jan 20 15:10  
> rbd\udata.e15aee238e1f29.0d15__head_FF1083D3__1
> -rw-r--r-- 1 root root 8388608 Jan 20 15:19  
> rbd\udata.e15aee238e1f29.100c__head_6B17F3D3__1
> -rw-r--r-- 1 root root 8388608 Jan 23 14:17  
> rbd\udata.e73cf7b03e0c6.0479__head_C16003D3__1
> -rw-r--r-- 1 root root 8388608 Jan 25 11:52  
> rbd\udata.d4edc95e884adc.00f4__head_00EE43D3__1
> -rw-r--r-- 1 root root 4194304 Jan 27 08:07  
> rbd\udata.34595be2237e6.0ad5__head_D3CC93D3__1
> -rw-r--r-- 1 root root 4194304 Jan 27 08:08  
> rbd\udata.34595be2237e6.0aff__head_3BF633D3__1
> -rw-r--r-- 1 root root 4194304 Jan 27 16:20  
> rbd\udata.8b61c69f34baf.876a__head_A60A63D3__1
> -rw-r--r-- 1 root root 4194304 Jan 29 17:45  
> rbd\udata.28fcaf199543c3.0ae7__head_C1BA53D3__1
> -rw-r--r-- 1 root root 4194304 Jan 30 06:33  
> rbd\udata.28fcaf199543c3.1832__head_6EC113D3__1
> -rw-r--r-- 1 root root 4194304 Jan 31 10:33  
> rb.0.ddcdf5.238e1f29.00e4__head_3F1543D3__1
> -rw-r--r-- 1 root root 4194304 Feb 13 06:14 

Re: [ceph-users] SMR disks go 100% busy after ~15 minutes

2017-02-13 Thread Wido den Hollander

> Op 13 februari 2017 om 15:57 schreef Peter Maloney 
> :
> 
> 
> Then you're not aware of what the SMR disks do. They are just slow for
> all writes, having to read the tracks around, then write it all again
> instead of just the one thing you really wanted to write, due to
> overlap. Then to partially mitigate this, they have some tiny write
> buffer like 8GB flash, and then they use that for the "normal" speed,
> and then when it's full, you crawl (at least this is what the seagate
> ones do). Journals aren't designed to solve that... they help prevent
> the sync load on the osd, but don't somehow make the throughput higher
> (at least not sustained). Even if the journal was perfectly designed for
> performance, it would still do absolutely nothing if it's full and the
> disk is still busy with the old flushing.
> 

Well, that explains indeed. I wasn't aware of the additional buffer inside a 
SMR disk.

I was asked to look at this system for somebody who bought SMR disks without 
knowing. As I never touch these disks I found the behavior odd.

The buffer explains it a lot better, wasn't aware that SMR disks have that.

SMR shouldn't be used in Ceph without proper support in Bluestore or XFS aware 
SMR.

Wido

> 
> On 02/13/17 15:49, Wido den Hollander wrote:
> > Hi,
> >
> > I have a odd case with SMR disks in a Ceph cluster. Before I continue, yes, 
> > I am fully aware of SMR and Ceph not playing along well, but there is 
> > something happening which I'm not able to fully explain.
> >
> > On a 2x replica cluster with 8TB Seagate SMR disks I can write with about 
> > 30MB/sec to each disk using a simple RADOS bench:
> >
> > $ rados bench -t 1
> > $ time rados put 1GB.bin
> >
> > Both ways I found out that the disk can write at that rate.
> >
> > Now, when I start a benchmark with 32 threads it writes fine. Not super 
> > fast, but it works.
> >
> > After 15 minutes or so various disks go to 100% busy and just stay there. 
> > These OSDs are being marked as down and some even commit suicide due to 
> > threads timing out.
> >
> > Stopping the RADOS bench and starting the OSDs again resolves the situation.
> >
> > I am trying to explain what's happening. I'm aware that SMR isn't very good 
> > at Random Writes. To partially overcome this there are Intel DC 3510s in 
> > there as Journal SSDs.
> >
> > Can anybody explain why this 100% busy pops up after 15 minutes or so?
> >
> > Obviously it would the best if BlueStore had SMR support, but for now it's 
> > just Filestore with XFS on there.
> >
> > Wido
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 1 PG stuck unclean (active+remapped) after OSD replacement

2017-02-13 Thread Eugen Block

Hi experts,

I have a strange situation right now. We are re-organizing our 4 node  
Hammer cluster from LVM-based OSDs to HDDs. When we did this on the  
first node last week, everything went smoothly, I removed the OSDs  
from the crush map and the rebalancing and recovery finished  
successfully.
This weekend we did the same with the second node, we created the  
HDD-based OSDs and added them to the cluster, waited for rebalancing  
to finish and then stopped the old OSDs. Only this time the recovery  
didn't completely finish, 4 PGs kept stuck unclean. I found out that 3  
of these 4 PGs had their primary OSD on that node. So I restarted the  
respective services and those 3 PGs recovered successfully. But there  
is one last PG that gives me headaches.


ceph@ndesan01:~ # ceph pg map 1.3d3
osdmap e24320 pg 1.3d3 (1.3d3) -> up [16,21] acting [16,21,0]

ceph@ndesan01:~/ceph-deploy> ceph osd tree
ID WEIGHT  TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 9.38985 root default
-2 1.19995 host ndesan01
 0 0.23999 osd.0  up  1.0  1.0
 1 0.23999 osd.1  up  1.0  1.0
 2 0.23999 osd.2  up  1.0  1.0
13 0.23999 osd.13 up  1.0  1.0
19 0.23999 osd.19 up  1.0  1.0
-3 1.81998 host ndesan02
 3   0 osd.3down0  1.0
 4   0 osd.4down0  1.0
 5   0 osd.5down0  1.0
 9   0 osd.9down  1.0  1.0
10   0 osd.10   down  1.0  1.0
 6 0.90999 osd.6  up  1.0  1.0
 7 0.90999 osd.7  up  1.0  1.0
-4 1.81998 host nde32
20 0.90999 osd.20 up  1.0  1.0
21 0.90999 osd.21 up  1.0  1.0
-5 4.54994 host ndesan03
14 0.90999 osd.14 up  1.0  1.0
15 0.90999 osd.15 up  1.0  1.0
16 0.90999 osd.16 up  1.0  1.0
17 0.90999 osd.17 up  1.0  1.0
18 0.90999 osd.18 up  1.0  1.0


All OSDs marked as "down" are going to be removed. I looked for that  
PG on all 3 nodes, and all of them have it. All services are up and  
running, but for some reason this PG is not aware of that. Is there  
any reasonable explanation and/or some advice how to get that PG  
recovered?


One thing I noticed:

The data on the primary OSD (osd.16) had different timestamps than on  
the other two OSDs:


---cut here---
ndesan03:~ # ls -rtl /var/lib/ceph/osd/ceph-16/current/1.3d3_head/
total 389436
-rw-r--r-- 1 root root   0 Jul 12  2016 __head_03D3__1
...
-rw-r--r-- 1 root root   0 Jan  9 10:43  
rbd\udata.bca465368d6b49.0a06__head_20EFF3D3__1
-rw-r--r-- 1 root root   0 Jan  9 10:43  
rbd\udata.bca465368d6b49.0a8b__head_A014F3D3__1
-rw-r--r-- 1 root root   0 Jan  9 10:44  
rbd\udata.bca465368d6b49.0e2c__head_00F2D3D3__1
-rw-r--r-- 1 root root   0 Jan  9 10:44  
rbd\udata.bca465368d6b49.0e6a__head_C91813D3__1
-rw-r--r-- 1 root root 8388608 Jan 20 13:53  
rbd\udata.cc94344e6afb66.08cb__head_6AA4B3D3__1
-rw-r--r-- 1 root root 8388608 Jan 20 14:47  
rbd\udata.e15aee238e1f29.05f0__head_C95063D3__1
-rw-r--r-- 1 root root 8388608 Jan 20 15:10  
rbd\udata.e15aee238e1f29.0d15__head_FF1083D3__1
-rw-r--r-- 1 root root 8388608 Jan 20 15:19  
rbd\udata.e15aee238e1f29.100c__head_6B17F3D3__1
-rw-r--r-- 1 root root 8388608 Jan 23 14:17  
rbd\udata.e73cf7b03e0c6.0479__head_C16003D3__1
-rw-r--r-- 1 root root 8388608 Jan 25 11:52  
rbd\udata.d4edc95e884adc.00f4__head_00EE43D3__1
-rw-r--r-- 1 root root 4194304 Jan 27 08:07  
rbd\udata.34595be2237e6.0ad5__head_D3CC93D3__1
-rw-r--r-- 1 root root 4194304 Jan 27 08:08  
rbd\udata.34595be2237e6.0aff__head_3BF633D3__1
-rw-r--r-- 1 root root 4194304 Jan 27 16:20  
rbd\udata.8b61c69f34baf.876a__head_A60A63D3__1
-rw-r--r-- 1 root root 4194304 Jan 29 17:45  
rbd\udata.28fcaf199543c3.0ae7__head_C1BA53D3__1
-rw-r--r-- 1 root root 4194304 Jan 30 06:33  
rbd\udata.28fcaf199543c3.1832__head_6EC113D3__1
-rw-r--r-- 1 root root 4194304 Jan 31 10:33  
rb.0.ddcdf5.238e1f29.00e4__head_3F1543D3__1
-rw-r--r-- 1 root root 4194304 Feb 13 06:14  
rbd\udata.856071751c29d.617b__head_E1E4A3D3__1

---cut here---

The other two OSDs have identical timestamps, I just post the  
(shortened) output of osd.21:


---cut here---
nde32:/var/lib/ceph/osd/ceph-21/current # ls -lrt  
/var/lib/ceph/osd/ceph-21/current/1.3d3_head/

total 389432
-rw-r--r-- 1 root root   0 Feb  6 15:29 __head_03D3__1
...
-rw-r--r-- 1 root root

Re: [ceph-users] SMR disks go 100% busy after ~15 minutes

2017-02-13 Thread Peter Maloney
Then you're not aware of what the SMR disks do. They are just slow for
all writes, having to read the tracks around, then write it all again
instead of just the one thing you really wanted to write, due to
overlap. Then to partially mitigate this, they have some tiny write
buffer like 8GB flash, and then they use that for the "normal" speed,
and then when it's full, you crawl (at least this is what the seagate
ones do). Journals aren't designed to solve that... they help prevent
the sync load on the osd, but don't somehow make the throughput higher
(at least not sustained). Even if the journal was perfectly designed for
performance, it would still do absolutely nothing if it's full and the
disk is still busy with the old flushing.


On 02/13/17 15:49, Wido den Hollander wrote:
> Hi,
>
> I have a odd case with SMR disks in a Ceph cluster. Before I continue, yes, I 
> am fully aware of SMR and Ceph not playing along well, but there is something 
> happening which I'm not able to fully explain.
>
> On a 2x replica cluster with 8TB Seagate SMR disks I can write with about 
> 30MB/sec to each disk using a simple RADOS bench:
>
> $ rados bench -t 1
> $ time rados put 1GB.bin
>
> Both ways I found out that the disk can write at that rate.
>
> Now, when I start a benchmark with 32 threads it writes fine. Not super fast, 
> but it works.
>
> After 15 minutes or so various disks go to 100% busy and just stay there. 
> These OSDs are being marked as down and some even commit suicide due to 
> threads timing out.
>
> Stopping the RADOS bench and starting the OSDs again resolves the situation.
>
> I am trying to explain what's happening. I'm aware that SMR isn't very good 
> at Random Writes. To partially overcome this there are Intel DC 3510s in 
> there as Journal SSDs.
>
> Can anybody explain why this 100% busy pops up after 15 minutes or so?
>
> Obviously it would the best if BlueStore had SMR support, but for now it's 
> just Filestore with XFS on there.
>
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] SMR disks go 100% busy after ~15 minutes

2017-02-13 Thread Wido den Hollander
Hi,

I have a odd case with SMR disks in a Ceph cluster. Before I continue, yes, I 
am fully aware of SMR and Ceph not playing along well, but there is something 
happening which I'm not able to fully explain.

On a 2x replica cluster with 8TB Seagate SMR disks I can write with about 
30MB/sec to each disk using a simple RADOS bench:

$ rados bench -t 1
$ time rados put 1GB.bin

Both ways I found out that the disk can write at that rate.

Now, when I start a benchmark with 32 threads it writes fine. Not super fast, 
but it works.

After 15 minutes or so various disks go to 100% busy and just stay there. These 
OSDs are being marked as down and some even commit suicide due to threads 
timing out.

Stopping the RADOS bench and starting the OSDs again resolves the situation.

I am trying to explain what's happening. I'm aware that SMR isn't very good at 
Random Writes. To partially overcome this there are Intel DC 3510s in there as 
Journal SSDs.

Can anybody explain why this 100% busy pops up after 15 minutes or so?

Obviously it would the best if BlueStore had SMR support, but for now it's just 
Filestore with XFS on there.

Wido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster

2017-02-13 Thread Donny Davis
I am having the same issue. When I looked at my idle cluster this morning,
one of the nodes had 400% cpu utilization, and ceph-mgr was 300% of that.
I have 3 AIO nodes, and only one of them seemed to be affected.

On Sat, Jan 14, 2017 at 12:18 AM, Brad Hubbard  wrote:

> Want to install debuginfo packages and use something like this to try
> and find out where it is spending most of its time?
>
> https://poormansprofiler.org/
>
> Note that you may need to do multiple runs to get a "feel" for where
> it is spending most of its time. Also not that likely only one or two
> threads will be using the CPU (you can see this in ps output using a
> command like the following) the rest will likely be idle or waiting
> for something.
>
> # ps axHo %cpu,stat,pid,tid,pgid,ppid,comm,wchan
>
> Observation of these two and maybe a couple of manual gstack dumps
> like this to compare thread ids to ps output (LWP is the thread id
> (tid) in gdb output) should give us some idea of where it is spinning.
>
> # gstack $(pidof ceph-mgr)
>
>
> On Sat, Jan 14, 2017 at 9:54 AM, Robert Longstaff
>  wrote:
> > FYI, I'm seeing this as well on the latest Kraken 11.1.1 RPMs on CentOS
> 7 w/
> > elrepo kernel 4.8.10. ceph-mgr is currently tearing through CPU and has
> > allocated ~11GB of RAM after a single day of usage. Only the active
> manager
> > is performing this way. The growth is linear and reproducible.
> >
> > The cluster is mostly idle; 3 mons (4 CPU, 16GB), 20 heads with 45x8TB
> OSDs
> > each.
> >
> >
> > top - 23:45:47 up 1 day,  1:32,  1 user,  load average: 3.56, 3.94, 4.21
> >
> > Tasks: 178 total,   1 running, 177 sleeping,   0 stopped,   0 zombie
> >
> > %Cpu(s): 33.9 us, 28.1 sy,  0.0 ni, 37.3 id,  0.0 wa,  0.0 hi,  0.7 si,
> 0.0
> > st
> >
> > KiB Mem : 16423844 total,  3980500 free, 11556532 used,   886812
> buff/cache
> >
> > KiB Swap:  2097148 total,  2097148 free,0 used.  4836772 avail
> Mem
> >
> >
> >   PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+
> COMMAND
> >
> >  2351 ceph  20   0 12.160g 0.010t  17380 S 203.7 64.8   2094:27
> ceph-mgr
> >
> >  2302 ceph  20   0  620316 267992 157620 S   2.3  1.6  65:11.50
> ceph-mon
> >
> >
> > On Wed, Jan 11, 2017 at 12:00 PM, Stillwell, Bryan J
> >  wrote:
> >>
> >> John,
> >>
> >> This morning I compared the logs from yesterday and I show a noticeable
> >> increase in messages like these:
> >>
> >> 2017-01-11 09:00:03.032521 7f70f15c1700 10 mgr handle_mgr_digest 575
> >> 2017-01-11 09:00:03.032523 7f70f15c1700 10 mgr handle_mgr_digest 441
> >> 2017-01-11 09:00:03.032529 7f70f15c1700 10 mgr notify_all notify_all:
> >> notify_all mon_status
> >> 2017-01-11 09:00:03.032532 7f70f15c1700 10 mgr notify_all notify_all:
> >> notify_all health
> >> 2017-01-11 09:00:03.032534 7f70f15c1700 10 mgr notify_all notify_all:
> >> notify_all pg_summary
> >> 2017-01-11 09:00:03.033613 7f70f15c1700  4 mgr ms_dispatch active
> >> mgrdigest v1
> >> 2017-01-11 09:00:03.033618 7f70f15c1700 -1 mgr ms_dispatch mgrdigest v1
> >> 2017-01-11 09:00:03.033620 7f70f15c1700 10 mgr handle_mgr_digest 575
> >> 2017-01-11 09:00:03.033622 7f70f15c1700 10 mgr handle_mgr_digest 441
> >> 2017-01-11 09:00:03.033628 7f70f15c1700 10 mgr notify_all notify_all:
> >> notify_all mon_status
> >> 2017-01-11 09:00:03.033631 7f70f15c1700 10 mgr notify_all notify_all:
> >> notify_all health
> >> 2017-01-11 09:00:03.033633 7f70f15c1700 10 mgr notify_all notify_all:
> >> notify_all pg_summary
> >> 2017-01-11 09:00:03.532898 7f70f15c1700  4 mgr ms_dispatch active
> >> mgrdigest v1
> >> 2017-01-11 09:00:03.532945 7f70f15c1700 -1 mgr ms_dispatch mgrdigest v1
> >>
> >>
> >> In a 1 minute period yesterday I saw 84 times this group of messages
> >> showed up.  Today that same group of messages showed up 156 times.
> >>
> >> Other than that I did see an increase in this messages from 9 times a
> >> minute to 14 times a minute:
> >>
> >> 2017-01-11 09:00:00.402000 7f70f3d61700  0 -- 172.24.88.207:6800/4104
> >> -
> >> conn(0x563c9ee89000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0
> >> l=0).fault with nothing to send and in the half  accept state just
> closed
> >>
> >> Let me know if you need anything else.
> >>
> >> Bryan
> >>
> >>
> >> On 1/10/17, 10:00 AM, "ceph-users on behalf of Stillwell, Bryan J"
> >>  >> bryan.stillw...@charter.com> wrote:
> >>
> >> >On 1/10/17, 5:35 AM, "John Spray"  wrote:
> >> >
> >> >>On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J
> >> >> wrote:
> >> >>> Last week I decided to play around with Kraken (11.1.1-1xenial) on a
> >> >>> single node, two OSD cluster, and after a while I noticed that the
> new
> >> >>> ceph-mgr daemon is frequently using a lot of the CPU:
> >> >>>
> >> >>> 17519 ceph  20   0  850044 168104208 S 102.7  4.3   1278:27
> >> >>> ceph-mgr
> >> >>>
> >> >>> Restarting it with 'systemctl restart ceph-mgr*' seems to get its
> CPU
> >> >>> usage down to < 1%, but after a while it climbs back up to > 100%.
> >> >>> Has
> >> >>> any

Re: [ceph-users] OSDs cannot match up with fast OSD map changes (epochs) during recovery

2017-02-13 Thread Wido den Hollander

> Op 13 februari 2017 om 12:57 schreef Muthusamy Muthiah 
> :
> 
> 
> Hi All,
> 
> We also have same issue on one of our platforms which was upgraded from
> 11.0.2 to 11.2.0 . The issue occurs on one node alone where CPU hits 100%
> and OSDs of that node marked down. Issue not seen on cluster which was
> installed from scratch with 11.2.0.
> 

How many maps is this OSD behind?

Does it help if you set the nodown flag for a moment to let it catch up?

Wido

> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> *[r...@cn3.c7.vna ~] # systemctl start ceph-osd@315.service
>  [r...@cn3.c7.vna ~] # cd /var/log/ceph/
> [r...@cn3.c7.vna ceph] # tail -f *osd*315.log 2017-02-13 11:29:46.752897
> 7f995c79b940  0 
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/11.2.0/rpm/el7/BUILD/ceph-11.2.0/src/cls/hello/cls_hello.cc:296:
> loading cls_hello 2017-02-13 11:29:46.753065 7f995c79b940  0 _get_class not
> permitted to load kvs 2017-02-13 11:29:46.757571 7f995c79b940  0 _get_class
> not permitted to load lua 2017-02-13 11:29:47.058720 7f995c79b940  0
> osd.315 44703 crush map has features 288514119978713088, adjusting msgr
> requires for clients 2017-02-13 11:29:47.058728 7f995c79b940  0 osd.315
> 44703 crush map has features 288514394856620032 was 8705, adjusting msgr
> requires for mons 2017-02-13 11:29:47.058732 7f995c79b940  0 osd.315 44703
> crush map has features 288531987042664448, adjusting msgr requires for osds
> 2017-02-13 11:29:48.343979 7f995c79b940  0 osd.315 44703 load_pgs
> 2017-02-13 11:29:55.913550 7f995c79b940  0 osd.315 44703 load_pgs opened
> 130 pgs 2017-02-13 11:29:55.913604 7f995c79b940  0 osd.315 44703 using 1 op
> queue with priority op cut off at 64. 2017-02-13 11:29:55.914102
> 7f995c79b940 -1 osd.315 44703 log_to_monitors {default=true} 2017-02-13
> 11:30:19.384897 7f9939bbb700  1 heartbeat_map reset_timeout 'tp_osd thread
> tp_osd' had timed out after 15 2017-02-13 11:30:31.073336 7f9955a2b700  1
> heartbeat_map is_healthy 'tp_osd thread tp_osd' had timed out after 15
> 2017-02-13 11:30:31.073343 7f9955a2b700  1 heartbeat_map is_healthy 'tp_osd
> thread tp_osd' had timed out after 15 2017-02-13 11:30:31.073344
> 7f9955a2b700  1 heartbeat_map is_healthy 'tp_osd thread tp_osd' had timed
> out after 15 2017-02-13 11:30:31.073345 7f9955a2b700  1 heartbeat_map
> is_healthy 'tp_osd thread tp_osd' had timed out after 15 2017-02-13
> 11:30:31.073347 7f9955a2b700  1 heartbeat_map is_healthy 'tp_osd thread
> tp_osd' had timed out after 15 2017-02-13 11:30:31.073348 7f9955a2b700  1
> heartbeat_map is_healthy 'tp_osd thread tp_osd' had timed out after
> 152017-02-13 11:30:54.772516 7f995c79b940  0 osd.315 44703 done with init,
> starting boot process*
> 
> 
> *Thanks,*
> *Muthu*
> 
> On 13 February 2017 at 10:50, Andreas Gerstmayr  > wrote:
> 
> > Hi,
> >
> > Due to a faulty upgrade from Jewel 10.2.0 to Kraken 11.2.0 our test
> > cluster is unhealthy since about two weeks and can't recover itself
> > anymore (unfortunately I skipped the upgrade to 10.2.5 because I
> > missed the ".z" in "All clusters must first be upgraded to Jewel
> > 10.2.z").
> >
> > Immediately after the upgrade I saw the following in the OSD logs:
> > s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing
> > to send and in the half  accept state just closed
> >
> > There are also missed heartbeats in the OSD logs, and the OSDs which
> > don't send heartbeats have the following in their logs:
> > 2017-02-08 19:44:51.367828 7f9be8c37700  1 heartbeat_map is_healthy
> > 'tp_osd thread tp_osd' had timed out after 15
> > 2017-02-08 19:44:54.271010 7f9bc4e96700  1 heartbeat_map reset_timeout
> > 'tp_osd thread tp_osd' had timed out after 15
> >
> > During investigating we found out that some OSDs were lagging about
> > 100-2 OSD map epochs behind. The monitor publishes new epochs
> > every few seconds, but the OSD daemons are pretty slow in applying
> > them (up to a few minutes for 100 epochs). During recovery of the 24
> > OSDs of a storage node the CPU is running at almost 100% (the nodes
> > have 16 real cores, or 32 with Hyper-Threading).
> >
> > We had at times servers where all 24 OSDs were up-to-date with the
> > latest OSD map, but somehow they lost it and were lagging behind
> > again. During recovery some OSDs used up to 25 GB of RAM, which led to
> > out of memory and further lagging of the OSDs of the affected server.
> >
> > We already set the nodown, noout, norebalance, nobackfill, norecover,
> > noscrub and nodeep-scrub flags to prevent OSD flapping and even more
> > new OSD epochs.
> >
> > Is there anything we can do to let the OSDs recover? It seems that the
> > servers don't have enough CPU resources for recovery. I already played
> > around with the osd map message max setting (when I increased it to
> > 1000 to speed up recovery, the OSDs didn't get any updates at all?),
> > and the osd heartbe

[ceph-users] 答复: 答复: mon is stuck in leveldb and costs nearly 100% cpu

2017-02-13 Thread Chenyehua
Thanks for the response, Shinobu
The warning disappears due to your suggesting solution, however the nearly 100% 
cpu cost still exists and concerns me a lot.
So, do you know why the cpu cost is so high?
Are there any solutions or suggestions to this problem?

Cheers

-邮件原件-
发件人: Shinobu Kinjo [mailto:ski...@redhat.com] 
发送时间: 2017年2月13日 10:54
收件人: chenyehua 11692 (RD)
抄送: kc...@redhat.com; ceph-users@lists.ceph.com
主题: Re: 答复: [ceph-users] mon is stuck in leveldb and costs nearly 100% cpu

O.k, that's reasonable answer. Would you do on all hosts which the MON are 
running on:

 #* ceph --admin-daemon /var/run/ceph/ceph-mon.`hostname -s`.asok config show | 
grep leveldb_log

Anyway you can compact leveldb size with at runtime:

 #* ceph tell mon.`hostname -s` compact

And you should set in ceph.conf to prevent same issue from the next:

 #* [mon]
 #* mon compact on start = true


On Mon, Feb 13, 2017 at 11:37 AM, Chenyehua  wrote:
> Sorry, I made a mistake, the ceph version is actually 0.94.5
>
> -邮件原件-
> 发件人: chenyehua 11692 (RD)
> 发送时间: 2017年2月13日 9:40
> 收件人: 'Shinobu Kinjo'
> 抄送: kc...@redhat.com; ceph-users@lists.ceph.com
> 主题: 答复: [ceph-users] mon is stuck in leveldb and costs nearly 100% cpu
>
> My ceph version is 10.2.5
>
> -邮件原件-
> 发件人: Shinobu Kinjo [mailto:ski...@redhat.com]
> 发送时间: 2017年2月12日 13:12
> 收件人: chenyehua 11692 (RD)
> 抄送: kc...@redhat.com; ceph-users@lists.ceph.com
> 主题: Re: [ceph-users] mon is stuck in leveldb and costs nearly 100% cpu
>
> Which Ceph version are you using?
>
> On Sat, Feb 11, 2017 at 5:02 PM, Chenyehua  wrote:
>> Dear Mr Kefu Chai
>>
>> Sorry to disturb you.
>>
>> I meet a problem recently. In my ceph cluster ,health status has 
>> warning “store is getting too big!” for several days; and  ceph-mon 
>> costs nearly 100% cpu;
>>
>> Have you ever met this situation?
>>
>> Some detailed information are attached below:
>>
>>
>>
>> root@cvknode17:~# ceph -s
>>
>> cluster 04afba60-3a77-496c-b616-2ecb5e47e141
>>
>>  health HEALTH_WARN
>>
>> mon.cvknode17 store is getting too big! 34104 MB >= 15360 
>> MB
>>
>>  monmap e1: 3 mons at
>> {cvknode15=172.16.51.15:6789/0,cvknode16=172.16.51.16:6789/0,cvknode1
>> 7
>> =172.16.51.17:6789/0}
>>
>> election epoch 862, quorum 0,1,2
>> cvknode15,cvknode16,cvknode17
>>
>>  osdmap e196279: 347 osds: 347 up, 347 in
>>
>>   pgmap v5891025: 33272 pgs, 16 pools, 26944 GB data, 6822 
>> kobjects
>>
>> 65966 GB used, 579 TB / 644 TB avail
>>
>>33270 active+clean
>>
>>2 active+clean+scrubbing+deep
>>
>>   client io 840 kB/s rd, 739 kB/s wr, 35 op/s rd, 184 op/s wr
>>
>>
>>
>> root@cvknode17:~# top
>>
>> top - 15:19:28 up 23 days, 23:58,  6 users,  load average: 1.08, 
>> 1.40,
>> 1.77
>>
>> Tasks: 346 total,   2 running, 342 sleeping,   0 stopped,   2 zombie
>>
>> Cpu(s):  8.1%us, 10.8%sy,  0.0%ni, 69.0%id,  9.5%wa,  0.0%hi,  
>> 2.5%si, 0.0%st
>>
>> Mem:  65384424k total, 58102880k used,  7281544k free,   240720k buffers
>>
>> Swap: 2100k total,   344944k used, 29654156k free, 24274272k cached
>>
>>
>>
>> PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
>>
>>   24407 root  20   0 17.3g  12g  10m S   98 20.2   8420:11 ceph-mon
>>
>>
>>
>> root@cvknode17:~# top -Hp 24407
>>
>> top - 15:19:49 up 23 days, 23:59,  6 users,  load average: 1.12, 
>> 1.39,
>> 1.76
>>
>> Tasks:  17 total,   1 running,  16 sleeping,   0 stopped,   0 zombie
>>
>> Cpu(s):  8.1%us, 10.8%sy,  0.0%ni, 69.0%id,  9.5%wa,  0.0%hi,  
>> 2.5%si, 0.0%st
>>
>> Mem:  65384424k total, 58104868k used,  7279556k free,   240744k buffers
>>
>> Swap: 2100k total,   344944k used, 29654156k free, 24271188k cached
>>
>>
>>
>> PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
>>
>>   25931 root  20   0 17.3g  12g   9m R   98 20.2   7957:37 ceph-mon
>>
>>   24514 root  20   0 17.3g  12g   9m S2 20.2   3:06.75 ceph-mon
>>
>>   25932 root  20   0 17.3g  12g   9m S2 20.2   1:07.82 ceph-mon
>>
>>   24407 root  20   0 17.3g  12g   9m S0 20.2   0:00.67 ceph-mon
>>
>>   24508 root  20   0 17.3g  12g   9m S0 20.2  15:50.24 ceph-mon
>>
>>   24513 root  20   0 17.3g  12g   9m S0 20.2   0:07.88 ceph-mon
>>
>>   24534 root  20   0 17.3g  12g   9m S0 20.2 196:33.85 ceph-mon
>>
>>   24535 root  20   0 17.3g  12g   9m S0 20.2   0:00.01 ceph-mon
>>
>>   25929 root  20   0 17.3g  12g   9m S0 20.2   3:06.09 ceph-mon
>>
>>   25930 root  20   0 17.3g  12g   9m S0 20.2   8:12.58 ceph-mon
>>
>>   25933 root  20   0 17.3g  12g   9m S0 20.2   4:42.22 ceph-mon
>>
>>   25934 root  20   0 17.3g  12g   9m S0 20.2  40:53.27 ceph-mon
>>
>>   25935 root  20   0 17.3g  12g   9m S0 20.2   0:04.84 ceph-mon
>>
>>   25936 root  20   0 17.3g  12g   9m S0 20.2   0:00.01 ceph-mon
>>
>>   25980 root  20   0 17.3g  12g   9m S0 20.2   0:06.65 ceph-mon
>>

Re: [ceph-users] - permission denied on journal after reboot

2017-02-13 Thread koukou73gr
On 2017-02-13 13:47, Wido den Hollander wrote:

> 
> The udev rules of Ceph should chown the journal to ceph:ceph if it's set to 
> the right partition UUID.
> 
> This blog shows it partially: 
> http://ceph.com/planet/ceph-recover-osds-after-ssd-journal-failure/
> 
> This is done by *95-ceph-osd.rules*, you might want to check the source of 
> that.
> 

Unfortunatelly the udev rules do not handle non-gpt partitioned disks.
This is because partition typecode GUID is just not supported on MBR
partition.

If your journals live on an MBR disk you'll have to add some custom udev
rules yourself. This is what I did:

[root@ceph-10-206-123-182 ~]# cat
/etc/udev/rules.d/70-persisnent-ceph-journal.rules
KERNEL=="sdc5", SUBSYSTEM=="block", ATTRS{model}=="KINGSTON SV300S3",
OWNER="ceph", GROUP="ceph"
KERNEL=="sdc6", SUBSYSTEM=="block", ATTRS{model}=="KINGSTON SV300S3",
OWNER="ceph", GROUP="ceph"

Cheers,

-K.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs cannot match up with fast OSD map changes (epochs) during recovery

2017-02-13 Thread Muthusamy Muthiah
Hi All,

We also have same issue on one of our platforms which was upgraded from
11.0.2 to 11.2.0 . The issue occurs on one node alone where CPU hits 100%
and OSDs of that node marked down. Issue not seen on cluster which was
installed from scratch with 11.2.0.





















*[r...@cn3.c7.vna ~] # systemctl start ceph-osd@315.service
 [r...@cn3.c7.vna ~] # cd /var/log/ceph/
[r...@cn3.c7.vna ceph] # tail -f *osd*315.log 2017-02-13 11:29:46.752897
7f995c79b940  0 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/11.2.0/rpm/el7/BUILD/ceph-11.2.0/src/cls/hello/cls_hello.cc:296:
loading cls_hello 2017-02-13 11:29:46.753065 7f995c79b940  0 _get_class not
permitted to load kvs 2017-02-13 11:29:46.757571 7f995c79b940  0 _get_class
not permitted to load lua 2017-02-13 11:29:47.058720 7f995c79b940  0
osd.315 44703 crush map has features 288514119978713088, adjusting msgr
requires for clients 2017-02-13 11:29:47.058728 7f995c79b940  0 osd.315
44703 crush map has features 288514394856620032 was 8705, adjusting msgr
requires for mons 2017-02-13 11:29:47.058732 7f995c79b940  0 osd.315 44703
crush map has features 288531987042664448, adjusting msgr requires for osds
2017-02-13 11:29:48.343979 7f995c79b940  0 osd.315 44703 load_pgs
2017-02-13 11:29:55.913550 7f995c79b940  0 osd.315 44703 load_pgs opened
130 pgs 2017-02-13 11:29:55.913604 7f995c79b940  0 osd.315 44703 using 1 op
queue with priority op cut off at 64. 2017-02-13 11:29:55.914102
7f995c79b940 -1 osd.315 44703 log_to_monitors {default=true} 2017-02-13
11:30:19.384897 7f9939bbb700  1 heartbeat_map reset_timeout 'tp_osd thread
tp_osd' had timed out after 15 2017-02-13 11:30:31.073336 7f9955a2b700  1
heartbeat_map is_healthy 'tp_osd thread tp_osd' had timed out after 15
2017-02-13 11:30:31.073343 7f9955a2b700  1 heartbeat_map is_healthy 'tp_osd
thread tp_osd' had timed out after 15 2017-02-13 11:30:31.073344
7f9955a2b700  1 heartbeat_map is_healthy 'tp_osd thread tp_osd' had timed
out after 15 2017-02-13 11:30:31.073345 7f9955a2b700  1 heartbeat_map
is_healthy 'tp_osd thread tp_osd' had timed out after 15 2017-02-13
11:30:31.073347 7f9955a2b700  1 heartbeat_map is_healthy 'tp_osd thread
tp_osd' had timed out after 15 2017-02-13 11:30:31.073348 7f9955a2b700  1
heartbeat_map is_healthy 'tp_osd thread tp_osd' had timed out after
152017-02-13 11:30:54.772516 7f995c79b940  0 osd.315 44703 done with init,
starting boot process*


*Thanks,*
*Muthu*

On 13 February 2017 at 10:50, Andreas Gerstmayr  wrote:

> Hi,
>
> Due to a faulty upgrade from Jewel 10.2.0 to Kraken 11.2.0 our test
> cluster is unhealthy since about two weeks and can't recover itself
> anymore (unfortunately I skipped the upgrade to 10.2.5 because I
> missed the ".z" in "All clusters must first be upgraded to Jewel
> 10.2.z").
>
> Immediately after the upgrade I saw the following in the OSD logs:
> s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing
> to send and in the half  accept state just closed
>
> There are also missed heartbeats in the OSD logs, and the OSDs which
> don't send heartbeats have the following in their logs:
> 2017-02-08 19:44:51.367828 7f9be8c37700  1 heartbeat_map is_healthy
> 'tp_osd thread tp_osd' had timed out after 15
> 2017-02-08 19:44:54.271010 7f9bc4e96700  1 heartbeat_map reset_timeout
> 'tp_osd thread tp_osd' had timed out after 15
>
> During investigating we found out that some OSDs were lagging about
> 100-2 OSD map epochs behind. The monitor publishes new epochs
> every few seconds, but the OSD daemons are pretty slow in applying
> them (up to a few minutes for 100 epochs). During recovery of the 24
> OSDs of a storage node the CPU is running at almost 100% (the nodes
> have 16 real cores, or 32 with Hyper-Threading).
>
> We had at times servers where all 24 OSDs were up-to-date with the
> latest OSD map, but somehow they lost it and were lagging behind
> again. During recovery some OSDs used up to 25 GB of RAM, which led to
> out of memory and further lagging of the OSDs of the affected server.
>
> We already set the nodown, noout, norebalance, nobackfill, norecover,
> noscrub and nodeep-scrub flags to prevent OSD flapping and even more
> new OSD epochs.
>
> Is there anything we can do to let the OSDs recover? It seems that the
> servers don't have enough CPU resources for recovery. I already played
> around with the osd map message max setting (when I increased it to
> 1000 to speed up recovery, the OSDs didn't get any updates at all?),
> and the osd heartbeat grace and osd thread timeout settings (to give
> the overloaded server more time), but without success so far. I've
> seen errors related to the AsyncMessenger in the logs, so I reverted
> back to the SimpleMessenger (which was working successfully with
> Jewel).
>
>
> Cluster details:
> 6 storage nodes with 2x Intel Xeon E5-2630 v3 8x2.40GHz
> 256GB RAM
> Each storage node has 24 HDDs attac

Re: [ceph-users] - permission denied on journal after reboot

2017-02-13 Thread Wido den Hollander

> Op 13 februari 2017 om 12:06 schreef Piotr Dzionek :
> 
> 
> Hi,
> 
> I am running ceph Jewel 10.2.5 with separate journals - ssd disks. It 
> runs pretty smooth, however I stumble upon an issue after system reboot. 
> Journal disks become owned by root and ceph failed to start.
> 
> /starting osd.4 at :/0 osd_data /var/lib/ceph/osd/ceph-4 
> /var/lib/ceph/osd/ceph-4/journal//
> //2017-02-10 16:24:29.924126 7fd07ab40800 -1 
> filestore(/var/lib/ceph/osd/ceph-4) mount failed to open journal 
> /var/lib/ceph/osd/ceph-4/journal: (13) Permission denied//
> //2017-02-10 16:24:29.924210 7fd07ab40800 -1 osd.4 0 OSD:init: unable to 
> mount object store//
> //2017-02-10 16:24:29.924217 7fd07ab40800 -1 #033[0;31m ** ERROR: osd 
> init failed: (13) Permission denied#033[0m/
> 
> I fixed this issue by finding journal disks in /dev dir and chown to 
> ceph:ceph. I remember that I had a similar issue after I installed it 
> for a first time. Is it a bug ? or do I have to set some kind of udev 
> rules for this disks?
> 
> FYI, I have this issue after every restart now.
> 

The udev rules of Ceph should chown the journal to ceph:ceph if it's set to the 
right partition UUID.

This blog shows it partially: 
http://ceph.com/planet/ceph-recover-osds-after-ssd-journal-failure/

This is done by *95-ceph-osd.rules*, you might want to check the source of that.

Wido

> Kind regards,
> Piotr Dzionek
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] - permission denied on journal after reboot

2017-02-13 Thread Craig Chi
Hi,

What is your OS? The permission of journal partition should be changed by udev 
rules: /lib/udev/rules.d/95-ceph-osd.rules
In this file, it is described as:
# JOURNAL_UUID
ACTION=="add", SUBSYSTEM=="block", \
ENV{DEVTYPE}=="partition", \
ENV{ID_PART_ENTRY_TYPE}=="45b0969e-9b03-4f30-b4c6-b4b80ceff106", \
OWNER:="ceph", GROUP:="ceph", MODE:="660", \
RUN+="/usr/sbin/ceph-disk --log-stdout -v trigger /dev/$name"

You can also use udevadm command to test whether the partition has been 
processed by the correct udev rule. Like following:

#>udevadm test /sys/block/sdb/sdb2

...
starting 'probe-bcache -o udev /dev/sdb2'
Process 'probe-bcache -o udev /dev/sdb2' succeeded.
OWNER 64045 /lib/udev/rules.d/95-ceph-osd.rules:16
GROUP 64045 /lib/udev/rules.d/95-ceph-osd.rules:16
MODE 0660 /lib/udev/rules.d/95-ceph-osd.rules:16
RUN '/usr/sbin/ceph-disk --log-stdout -v trigger /dev/$name' 
/lib/udev/rules.d/95-ceph-osd.rules:16
...

Then /dev/sdb2 will have ceph:ceph permission automatically.

#>ls -l /dev/sdb2
brw-rw 1 ceph ceph 8, 18 Feb 13 19:43 /dev/sdb2

Sincerely,
Craig Chi

On 2017-02-13 19:06, Piotr Dzionekwrote:
> 
> Hi,
> 
> 
> I am running ceph Jewel 10.2.5 with separate journals - ssd disks. It runs 
> pretty smooth, however I stumble upon an issue after system reboot. Journal 
> disks become owned by root and ceph failed to start.
> 
> 
> starting osd.4 at :/0 osd_data /var/lib/ceph/osd/ceph-4 
> /var/lib/ceph/osd/ceph-4/journal
> 2017-02-10 16:24:29.924126 7fd07ab40800 -1 
> filestore(/var/lib/ceph/osd/ceph-4) mount failed to open journal 
> /var/lib/ceph/osd/ceph-4/journal: (13) Permission denied
> 2017-02-10 16:24:29.924210 7fd07ab40800 -1 osd.4 0 OSD:init: unable to mount 
> object store
> 2017-02-10 16:24:29.924217 7fd07ab40800 -1 #033[0;31m ** ERROR: osd init 
> failed: (13) Permission denied#033[0m
> 
> 
> I fixed this issue by finding journal disks in /dev dir and chown to 
> ceph:ceph. I remember that I had a similar issue after I installed it for a 
> first time. Is it a bug ? or do I have to set some kind of udev rules for 
> this disks?
> 
> 
> FYI, I have this issue after every restart now.
> 
> 
> Kind regards,
> Piotr Dzionek
> 
> 
> ___ ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] - permission denied on journal after reboot

2017-02-13 Thread Piotr Dzionek

Hi,

I am running ceph Jewel 10.2.5 with separate journals - ssd disks. It 
runs pretty smooth, however I stumble upon an issue after system reboot. 
Journal disks become owned by root and ceph failed to start.


/starting osd.4 at :/0 osd_data /var/lib/ceph/osd/ceph-4 
/var/lib/ceph/osd/ceph-4/journal//
//2017-02-10 16:24:29.924126 7fd07ab40800 -1 
filestore(/var/lib/ceph/osd/ceph-4) mount failed to open journal 
/var/lib/ceph/osd/ceph-4/journal: (13) Permission denied//
//2017-02-10 16:24:29.924210 7fd07ab40800 -1 osd.4 0 OSD:init: unable to 
mount object store//
//2017-02-10 16:24:29.924217 7fd07ab40800 -1 #033[0;31m ** ERROR: osd 
init failed: (13) Permission denied#033[0m/


I fixed this issue by finding journal disks in /dev dir and chown to 
ceph:ceph. I remember that I had a similar issue after I installed it 
for a first time. Is it a bug ? or do I have to set some kind of udev 
rules for this disks?


FYI, I have this issue after every restart now.

Kind regards,
Piotr Dzionek

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Anyone using LVM or ZFS RAID1 for boot drives?

2017-02-13 Thread Willem Jan Withagen
On 13-2-2017 04:22, Alex Gorbachev wrote:
> Hello, with the preference for IT mode HBAs for OSDs and journals,
> what redundancy method do you guys use for the boot drives.  Some
> options beyond RAID1 at hardware level we can think of:
> 
> - LVM
> 
> - ZFS RAID1 mode

Since it is not quite Ceph, I take the liberty to answer with a bit not
Linux. :)

On FreeBSD I always use RAID1 bootdisks, it is natively supoorted from
both kernel and installer. Fits really nice with the upgrading tools,
allowing it to roll-back if upgrades did not work, or to boot with one
of the previous snapshotted bootdisks.

NAME SIZE  ALLOC   FREE  EXPANDSZ   FRAGCAP  DEDUP  HEALTH
zfsroot  228G  2.57G   225G - 0% 1%  1.00x  ONLINE
  mirror 228G  2.57G   225G - 0% 1%
ada0p3  -  -  - -  -  -
ada1p3  -  -  - -  -  -

zfsroot   2.57G   218G19K  /zfsroot
zfsroot/ROOT  1.97G   218G19K  none
zfsroot/ROOT/default  1.97G   218G  1.97G  /
zfsroot/tmp   22.5K   218G  22.5K  /tmp
zfsroot/usr613M   218G19K  /usr
zfsroot/usr/compat  19K   218G19K  /usr/compat
zfsroot/usr/home34K   218G34K  /usr/home
zfsroot/usr/local  613M   218G   613M  /usr/local
zfsroot/usr/ports   19K   218G19K  /usr/ports
zfsroot/usr/src 19K   218G19K  /usr/src
zfsroot/var230K   218G19K  /var
zfsroot/var/audit   19K   218G19K  /var/audit
zfsroot/var/crash   19K   218G19K  /var/crash
zfsroot/var/log135K   218G   135K  /var/log
zfsroot/var/mail19K   218G19K  /var/mail
zfsroot/var/tmp 19K   218G19K  /var/tmp

Live maintenance is also a piece of cake with this.

If on a server SSDs are used, then I add a bit of cache. But as you see
the root stuff, include /usr and likes, is only 2.5Gb. And the most used
part will be in ZFS ARC, certainly if you did not save cost on RAM.

--WjW




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 答复: mon is stuck in leveldb and costs nearly 100% cpu

2017-02-13 Thread kefu chai
On Mon, Feb 13, 2017 at 10:53 AM, Shinobu Kinjo  wrote:
> O.k, that's reasonable answer. Would you do on all hosts which the MON
> are running on:
>
>  #* ceph --admin-daemon /var/run/ceph/ceph-mon.`hostname -s`.asok
> config show | grep leveldb_log
>
> Anyway you can compact leveldb size with at runtime:
>
>  #* ceph tell mon.`hostname -s` compact
>
> And you should set in ceph.conf to prevent same issue from the next:
>
>  #* [mon]
>  #* mon compact on start = true
>
>
> On Mon, Feb 13, 2017 at 11:37 AM, Chenyehua  wrote:
>> Sorry, I made a mistake, the ceph version is actually 0.94.5

the latest hammer is v0.94.9, probably you can give it a try? and
FWIW, hammer will
EOL in this spring.

>>
>> -邮件原件-
>> 发件人: chenyehua 11692 (RD)
>> 发送时间: 2017年2月13日 9:40
>> 收件人: 'Shinobu Kinjo'
>> 抄送: kc...@redhat.com; ceph-users@lists.ceph.com
>> 主题: 答复: [ceph-users] mon is stuck in leveldb and costs nearly 100% cpu
>>
>> My ceph version is 10.2.5
>>
>> -邮件原件-
>> 发件人: Shinobu Kinjo [mailto:ski...@redhat.com]
>> 发送时间: 2017年2月12日 13:12
>> 收件人: chenyehua 11692 (RD)
>> 抄送: kc...@redhat.com; ceph-users@lists.ceph.com
>> 主题: Re: [ceph-users] mon is stuck in leveldb and costs nearly 100% cpu
>>
>> Which Ceph version are you using?
>>
>> On Sat, Feb 11, 2017 at 5:02 PM, Chenyehua  wrote:
>>> Dear Mr Kefu Chai
>>>
>>> Sorry to disturb you.
>>>
>>> I meet a problem recently. In my ceph cluster ,health status has
>>> warning “store is getting too big!” for several days; and  ceph-mon
>>> costs nearly 100% cpu;
>>>
>>> Have you ever met this situation?
>>>
>>> Some detailed information are attached below:
>>>
>>>
>>>
>>> root@cvknode17:~# ceph -s
>>>
>>> cluster 04afba60-3a77-496c-b616-2ecb5e47e141
>>>
>>>  health HEALTH_WARN
>>>
>>> mon.cvknode17 store is getting too big! 34104 MB >= 15360
>>> MB
>>>
>>>  monmap e1: 3 mons at
>>> {cvknode15=172.16.51.15:6789/0,cvknode16=172.16.51.16:6789/0,cvknode17
>>> =172.16.51.17:6789/0}
>>>
>>> election epoch 862, quorum 0,1,2
>>> cvknode15,cvknode16,cvknode17
>>>
>>>  osdmap e196279: 347 osds: 347 up, 347 in
>>>
>>>   pgmap v5891025: 33272 pgs, 16 pools, 26944 GB data, 6822
>>> kobjects
>>>
>>> 65966 GB used, 579 TB / 644 TB avail
>>>
>>>33270 active+clean
>>>
>>>2 active+clean+scrubbing+deep
>>>
>>>   client io 840 kB/s rd, 739 kB/s wr, 35 op/s rd, 184 op/s wr
>>>
>>>
>>>
>>> root@cvknode17:~# top
>>>
>>> top - 15:19:28 up 23 days, 23:58,  6 users,  load average: 1.08, 1.40,
>>> 1.77
>>>
>>> Tasks: 346 total,   2 running, 342 sleeping,   0 stopped,   2 zombie
>>>
>>> Cpu(s):  8.1%us, 10.8%sy,  0.0%ni, 69.0%id,  9.5%wa,  0.0%hi,  2.5%si,
>>> 0.0%st
>>>
>>> Mem:  65384424k total, 58102880k used,  7281544k free,   240720k buffers
>>>
>>> Swap: 2100k total,   344944k used, 29654156k free, 24274272k cached
>>>
>>>
>>>
>>> PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
>>>
>>>   24407 root  20   0 17.3g  12g  10m S   98 20.2   8420:11 ceph-mon
>>>
>>>
>>>
>>> root@cvknode17:~# top -Hp 24407
>>>
>>> top - 15:19:49 up 23 days, 23:59,  6 users,  load average: 1.12, 1.39,
>>> 1.76
>>>
>>> Tasks:  17 total,   1 running,  16 sleeping,   0 stopped,   0 zombie
>>>
>>> Cpu(s):  8.1%us, 10.8%sy,  0.0%ni, 69.0%id,  9.5%wa,  0.0%hi,  2.5%si,
>>> 0.0%st
>>>
>>> Mem:  65384424k total, 58104868k used,  7279556k free,   240744k buffers
>>>
>>> Swap: 2100k total,   344944k used, 29654156k free, 24271188k cached
>>>
>>>
>>>
>>> PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
>>>
>>>   25931 root  20   0 17.3g  12g   9m R   98 20.2   7957:37 ceph-mon
>>>
>>>   24514 root  20   0 17.3g  12g   9m S2 20.2   3:06.75 ceph-mon
>>>
>>>   25932 root  20   0 17.3g  12g   9m S2 20.2   1:07.82 ceph-mon
>>>
>>>   24407 root  20   0 17.3g  12g   9m S0 20.2   0:00.67 ceph-mon
>>>
>>>   24508 root  20   0 17.3g  12g   9m S0 20.2  15:50.24 ceph-mon
>>>
>>>   24513 root  20   0 17.3g  12g   9m S0 20.2   0:07.88 ceph-mon
>>>
>>>   24534 root  20   0 17.3g  12g   9m S0 20.2 196:33.85 ceph-mon
>>>
>>>   24535 root  20   0 17.3g  12g   9m S0 20.2   0:00.01 ceph-mon
>>>
>>>   25929 root  20   0 17.3g  12g   9m S0 20.2   3:06.09 ceph-mon
>>>
>>>   25930 root  20   0 17.3g  12g   9m S0 20.2   8:12.58 ceph-mon
>>>
>>>   25933 root  20   0 17.3g  12g   9m S0 20.2   4:42.22 ceph-mon
>>>
>>>   25934 root  20   0 17.3g  12g   9m S0 20.2  40:53.27 ceph-mon
>>>
>>>   25935 root  20   0 17.3g  12g   9m S0 20.2   0:04.84 ceph-mon
>>>
>>>   25936 root  20   0 17.3g  12g   9m S0 20.2   0:00.01 ceph-mon
>>>
>>>   25980 root  20   0 17.3g  12g   9m S0 20.2   0:06.65 ceph-mon
>>>
>>>   25986 root  20   0 17.3g  12g   9m S0 20.2  48:26.77 ceph-mon
>>>
>>>   55738 root  20   0 17.3g  12g   9m S0 20.2   0:09.06 ceph-mon
>>>
>>>
>>>
>>>
>>>
>>> Thread 20 (Thread