Re: [ceph-users] KVM / Ceph performance problems

2016-11-23 Thread Peter Maloney
Are you using virtio_scsi? I found it is much faster on ceph with fio
benchmarks. (and also it supports trim/discard)

https://pve.proxmox.com/wiki/Qemu_discard

On 11/23/16 07:53, M. Piscaer wrote:
> Hi,
>
> I have an little performance problem with KVM and Ceph.
>
> I'm using Proxmox 4.3-10/7230e60f, with KVM version
> pve-qemu-kvm_2.7.0-8. Ceph is on version jewel 10.2.3 on both the
> cluster as the client (ceph-common).
>
> The systems are connected to the network via an 4x bonding with an total
> of 4 Gb/s.
>
> Within an guest,
> - when I do an write to I get about 10 MB/s.
> - Also when I try to do an write within the guest but then directly to
> ceph I get the same speed.
> - But when I mount an ceph object on the Proxmox host I get about 110MB/s
>
> The guest is connected to interface vmbr160 → bond0.160 → bond0.
>
> This bridge vmbr160 has an IP address with the same subnet as the ceph
> cluster with an mtu 9000.
>
> The KVM block device is an virtio device.
>
> What can I do to solve this problem?
>
> Kind regards,
>
> Michiel Piscaer
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 


Peter Maloney
Brockmann Consult
Max-Planck-Str. 2
21502 Geesthacht
Germany
Tel: +49 4152 889 300
Fax: +49 4152 889 333
E-mail: peter.malo...@brockmann-consult.de
Internet: http://www.brockmann-consult.de


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] KVM / Ceph performance problems

2016-11-23 Thread M. Piscaer
Hi,

Thank you for your help.

After changing these settings the linux guest got en increase in speed.
The FreeNAS guest still has an write speed of 10MB/s.

The disk driver is virtio and has en Write back cache.

What am I missing?

Kinds regards,

Michiel Piscaer

On 23-11-16 08:05, Оралов, Алексей С. wrote:
> hello, Michiel
> 
> Use hdd driver "virtio" and cache "Write back".
> 
> And also on proxmox node add ceph client configuration:
> 
> /etc/ceph/ceph.conf
> 
> [client]
> #admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok
> rbd cache = true
> rbd_cache_writethrough_until_flush = true
> rbd_readahead_disable_after_bytes=0
> rbd_default_format = 2
> 
> #Tuning options
> #rbd_cache_size = 67108864  #64M
> #rbd_cache_max_dirty = 50331648  #48M
> #rbd_cache_target_dirty = 33554432  #32M
> #rbd_cache_max_dirty_age = 2
> #rbd_op_threads = 10
> #rbd_readahead_trigger_requests = 10
> 
> 
> 23.11.2016 9:53, M. Piscaer пишет:
>> Hi,
>>
>> I have an little performance problem with KVM and Ceph.
>>
>> I'm using Proxmox 4.3-10/7230e60f, with KVM version
>> pve-qemu-kvm_2.7.0-8. Ceph is on version jewel 10.2.3 on both the
>> cluster as the client (ceph-common).
>>
>> The systems are connected to the network via an 4x bonding with an total
>> of 4 Gb/s.
>>
>> Within an guest,
>> - when I do an write to I get about 10 MB/s.
>> - Also when I try to do an write within the guest but then directly to
>> ceph I get the same speed.
>> - But when I mount an ceph object on the Proxmox host I get about 110MB/s
>>
>> The guest is connected to interface vmbr160 → bond0.160 → bond0.
>>
>> This bridge vmbr160 has an IP address with the same subnet as the ceph
>> cluster with an mtu 9000.
>>
>> The KVM block device is an virtio device.
>>
>> What can I do to solve this problem?
>>
>> Kind regards,
>>
>> Michiel Piscaer
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs (rbd) read performance low - where is thebottleneck?

2016-11-23 Thread JiaJia Zhong
Mike,
if you run mount.ceph with "-v" options, you may get "ceph: Unknown mount 
option rsize",
actually, you could ignore this, the rsize and rasize will both be passed to 
mount syscall.


I belive that you have had the cephfs mounted successfully,
run "mount" in terminal to check the actual mount opts in mtab.


-- Original --
From:  "Mike Miller";
Date:  Wed, Nov 23, 2016 02:38 PM
To:  "Eric Eastman"; 
Cc:  "Ceph Users"; 
Subject:  Re: [ceph-users] cephfs (rbd) read performance low - where is 
thebottleneck?

 
Hi,

did some testing multithreaded access and dd, performance scales as it 
should.

Any ideas to improve single threaded read performance further would be 
highly appreciated. Some of our use cases requires that we need to read 
large files by a single thread.

I have tried changing the readahead on the kernel client cephfs mount 
too, rsize and rasize.

mount.ceph ... -o name=cephfs,secretfile=secret.key,rsize=67108864

Doing this on kernel 4.5.2 gives the error message:
"ceph: Unknown mount option rsize"
or unknown rasize.

Can someone explain to me how I can experiment with readahead on cephfs?

Mike

On 11/21/16 12:33 PM, Eric Eastman wrote:
> Have you looked at your file layout?
>
> On a test cluster running 10.2.3 I created a 5GB file and then looked
> at the layout:
>
> # ls -l test.dat
>   -rw-r--r-- 1 root root 524288 Nov 20 23:09 test.dat
> # getfattr -n ceph.file.layout test.dat
>   # file: test.dat
>   ceph.file.layout="stripe_unit=4194304 stripe_count=1
> object_size=4194304 pool=cephfs_data"
>
> From what I understand with this layout you are reading 4MB of data
> from 1 OSD at a time so I think you are seeing the overall speed of a
> single SATA drive.  I do not think increasing your MON/MDS links to
> 10Gb will help, nor for a single file read will it help by going to
> SSD for the metadata.
>
> To test this, you may want to try creating 10 x 50GB files, and then
> read them in parallel and see if your overall throughput increases.
> If so, take a look at the layout parameters and see if you can change
> the file layout to get more parallelization.
>
> https://github.com/ceph/ceph/blob/master/doc/dev/file-striping.rst
> https://github.com/ceph/ceph/blob/master/doc/cephfs/file-layouts.rst
>
> Regards,
> Eric
>
> On Sun, Nov 20, 2016 at 3:24 AM, Mike Miller  wrote:
>> Hi,
>>
>> reading a big file 50 GB (tried more too)
>>
>> dd if=bigfile of=/dev/zero bs=4M
>>
>> in a cluster with 112 SATA disks in 10 osd (6272 pgs, replication 3) gives
>> me only about *122 MB/s* read speed in single thread. Scrubbing turned off
>> during measurement.
>>
>> I have been searching for possible bottlenecks. The network is not the
>> problem, the machine running dd is connected to the cluster public network
>> with a 20 GBASE-T bond. osd dual network: cluster public 10 GBASE-T, private
>> 10 GBASE-T.
>>
>> The osd SATA disks are utilized only up until about 10% or 20%, not more
>> than that. CPUs on osd idle too. CPUs on mon idle, mds usage about 1.0 (1
>> core is used on this 6-core machine). mon and mds connected with only 1 GbE
>> (I would expect some latency from that, but no bandwidth issues; in fact
>> network bandwidth is about 20 Mbit max).
>>
>> If I read a file with 50 GB, then clear the cache on the reading machine
>> (but not the osd caches), I get much better reading performance of about
>> *620 MB/s*. That seems logical to me as much (most) of the data is still in
>> the osd cache buffers. But still the read performance is not super
>> considered that the reading machine is connected to the cluster with a 20
>> Gbit/s bond.
>>
>> How can I improve? I am not really sure, but from my understanding 2
>> possible bottlenecks come to mind:
>>
>> 1) 1 GbE connection to mon / mds
>>
>> Is this the reason why reads are slow and osd disks are not hammered by read
>> requests and therewith fully utilized?
>>
>> 2) Move metadata to SSD
>>
>> Currently, cephfs_metadata is on the same pool as the data on the spinning
>> SATA disks. Is this the bottleneck? Is the move of metadata to SSD a
>> solution?
>>
>> Or is it both?
>>
>> Your experience and insight are highly appreciated.
>>
>> Thanks,
>>
>> Mike
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph strange issue after adding a cache OSD.

2016-11-23 Thread Nick Fisk
Hi Daznis,

I'm not sure how much help I can be, but I will try my best.

I think the post-split stats error is probably benign, although I think this 
suggests you also increased the number of PG's in your
cache pool? If so did you do this before or after you added the extra OSD's?  
This may have been the cause.

On to the actual assert, this looks like it's part of the code which trims the 
tiering hit set's. I don't understand why its
crashing out, but it must be related to an invalid or missing hitset I would 
imagine.

https://github.com/ceph/ceph/blob/v0.94.9/src/osd/ReplicatedPG.cc#L10485

The only thing I could think of from looking at in the code is that the 
function loops through all hitsets that are above the max
number (hit_set_count). I wonder if setting this number higher would mean it 
won't try and trim any hitsets and let things recover?

DISCLAIMER
This is a hunch, it might not work or could possibly even make things worse. 
Otherwise wait for someone who has a better idea to
comment.

Nick



> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Daznis
> Sent: 23 November 2016 05:57
> To: ceph-users 
> Subject: [ceph-users] Ceph strange issue after adding a cache OSD.
> 
> Hello,
> 
> 
> The story goes like this.
> I have added another 3 drives to the caching layer. OSDs were added to crush 
> map one by one after each successful rebalance. When
I
> added the last OSD and went away for about an hour I noticed that it's still 
> not finished rebalancing. Further investigation
showed me
> that it one of the older cache SSD was restarting like crazy before full 
> boot. So I shut it down and waited for a rebalance
without that
> OSD. Less than an hour later I had another 2 OSD restarting like crazy. I 
> tried running scrubs on the PG's logs asked me to, but
that did
> not help. I'm currently stuck with " 8 scrub errors" and a complete dead 
> cluster.
> 
> log_channel(cluster) log [WRN] : pg 15.8d has invalid (post-split) stats; 
> must scrub before tier agent can activate
> 
> 
> I need help with OSD from crashing. Crash log:
>  0> 2016-11-23 06:41:43.365602 7f935b4eb700 -1
> osd/ReplicatedPG.cc: In function 'void
> ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)'
> thread 7f935b4eb700 time 2016-11-23 06:41:43.363067
> osd/ReplicatedPG.cc: 10521: FAILED assert(obc)
> 
>  ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x85) [0xbde2c5]
>  2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned
> int)+0x75f) [0x87e89f]
>  3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f8bb]
>  4: (ReplicatedPG::do_op(std::tr1::shared_ptr&)+0xe3a) [0x8a11aa]
>  5: (ReplicatedPG::do_request(std::tr1::shared_ptr&,
> ThreadPool::TPHandle&)+0x68a) [0x83c37a]
>  6: (OSD::dequeue_op(boost::intrusive_ptr,
> std::tr1::shared_ptr, ThreadPool::TPHandle&)+0x405) [0x69af05]
>  7: (OSD::ShardedOpWQ::_process(unsigned int,
> ceph::heartbeat_handle_d*)+0x333) [0x69b473]
>  8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) 
> [0xbcd9cf]
>  9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcfb00]
>  10: (()+0x7dc5) [0x7f93b9df4dc5]
>  11: (clone()+0x6d) [0x7f93b88d5ced]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
> interpret this.
> 
> 
> I have tried looking with  full debug enabled, but those logs didn't help me 
> much. I have tried to evict the cache layer, but some
> objects are stuck and can't be removed. Any suggestions would be greatly 
> appreciated.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] export-diff behavior if an initial snapshot is NOT specified

2016-11-23 Thread Zhongyan Gu
Let me make the issue more clear.
Suppose I cloned image A from a parent image and create snap1 for image A
and  then make some change of image A.
If I did the rbd export-diff @snap1. how should I prepare the existing
image B to make sure it  will be exactly same with image A@snap1 after
import-diff against this image B.

Thanks,
Zhongyan


On Wed, Nov 23, 2016 at 11:34 AM, Zhongyan Gu  wrote:

> Thanks Jason, very clear explanation.
> However, I found some strange behavior when export-diff on a cloned image,
> not sure it is a bug on calc_snap_set_diff().
> The test is,
> Image A is cloned from a parent image. then create snap1 for image A.
> The content of export-diff A@snap1 will be changed when update image A.
> Only after image A has no overlap with parent, the content of export-diff
> A@snap1 is stabled, which is almost zero.
> I don't think it is a designed behavior. export-diff A@snap1 should
> always get a stable output no matter image A is cloned or not.
>
> Please correct me if anything wrong.
>
> Thanks,
> Zhongyan
>
>
>
>
> On Tue, Nov 22, 2016 at 10:31 PM, Jason Dillaman 
> wrote:
>
>> On Tue, Nov 22, 2016 at 5:31 AM, Zhongyan Gu 
>> wrote:
>> > So if initial snapshot is NOT specified, then:
>> > rbd export-diff image@snap1 will diff all data to snap1. this cmd
>> equals to
>> > :
>> > rbd export image@snap1. Is my understand right or not??
>>
>>
>> While they will both export all data associated w/ image@snap1, the
>> "export" command will generate a raw, non-sparse dump of the full
>> image whereas "export-diff" will export only sections of the image
>> that contain data. The file generated from "export" can be used with
>> the "import" command to create a new image, whereas the file generated
>> from "export-diff" can only be used with "import-diff" against an
>> existing image.
>>
>> --
>> Jason
>>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph strange issue after adding a cache OSD.

2016-11-23 Thread Daznis
Hi,


Looks like one of my colleagues increased the PG number before it
finished. I was flushing the whole cache tier and it's currently stuck
on ~80 GB of data, because of the OSD crashes. I will look into the
hitset counts and check what can be done. Will provide an update if I
find anything or fix the issue.


On Wed, Nov 23, 2016 at 12:04 PM, Nick Fisk  wrote:
> Hi Daznis,
>
> I'm not sure how much help I can be, but I will try my best.
>
> I think the post-split stats error is probably benign, although I think this 
> suggests you also increased the number of PG's in your
> cache pool? If so did you do this before or after you added the extra OSD's?  
> This may have been the cause.
>
> On to the actual assert, this looks like it's part of the code which trims 
> the tiering hit set's. I don't understand why its
> crashing out, but it must be related to an invalid or missing hitset I would 
> imagine.
>
> https://github.com/ceph/ceph/blob/v0.94.9/src/osd/ReplicatedPG.cc#L10485
>
> The only thing I could think of from looking at in the code is that the 
> function loops through all hitsets that are above the max
> number (hit_set_count). I wonder if setting this number higher would mean it 
> won't try and trim any hitsets and let things recover?
>
> DISCLAIMER
> This is a hunch, it might not work or could possibly even make things worse. 
> Otherwise wait for someone who has a better idea to
> comment.
>
> Nick
>
>
>
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
>> Daznis
>> Sent: 23 November 2016 05:57
>> To: ceph-users 
>> Subject: [ceph-users] Ceph strange issue after adding a cache OSD.
>>
>> Hello,
>>
>>
>> The story goes like this.
>> I have added another 3 drives to the caching layer. OSDs were added to crush 
>> map one by one after each successful rebalance. When
> I
>> added the last OSD and went away for about an hour I noticed that it's still 
>> not finished rebalancing. Further investigation
> showed me
>> that it one of the older cache SSD was restarting like crazy before full 
>> boot. So I shut it down and waited for a rebalance
> without that
>> OSD. Less than an hour later I had another 2 OSD restarting like crazy. I 
>> tried running scrubs on the PG's logs asked me to, but
> that did
>> not help. I'm currently stuck with " 8 scrub errors" and a complete dead 
>> cluster.
>>
>> log_channel(cluster) log [WRN] : pg 15.8d has invalid (post-split) stats; 
>> must scrub before tier agent can activate
>>
>>
>> I need help with OSD from crashing. Crash log:
>>  0> 2016-11-23 06:41:43.365602 7f935b4eb700 -1
>> osd/ReplicatedPG.cc: In function 'void
>> ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)'
>> thread 7f935b4eb700 time 2016-11-23 06:41:43.363067
>> osd/ReplicatedPG.cc: 10521: FAILED assert(obc)
>>
>>  ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x85) [0xbde2c5]
>>  2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned
>> int)+0x75f) [0x87e89f]
>>  3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f8bb]
>>  4: (ReplicatedPG::do_op(std::tr1::shared_ptr&)+0xe3a) [0x8a11aa]
>>  5: (ReplicatedPG::do_request(std::tr1::shared_ptr&,
>> ThreadPool::TPHandle&)+0x68a) [0x83c37a]
>>  6: (OSD::dequeue_op(boost::intrusive_ptr,
>> std::tr1::shared_ptr, ThreadPool::TPHandle&)+0x405) [0x69af05]
>>  7: (OSD::ShardedOpWQ::_process(unsigned int,
>> ceph::heartbeat_handle_d*)+0x333) [0x69b473]
>>  8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) 
>> [0xbcd9cf]
>>  9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcfb00]
>>  10: (()+0x7dc5) [0x7f93b9df4dc5]
>>  11: (clone()+0x6d) [0x7f93b88d5ced]
>>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
>> interpret this.
>>
>>
>> I have tried looking with  full debug enabled, but those logs didn't help me 
>> much. I have tried to evict the cache layer, but some
>> objects are stuck and can't be removed. Any suggestions would be greatly 
>> appreciated.
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph strange issue after adding a cache OSD.

2016-11-23 Thread Nick Fisk
> -Original Message-
> From: Daznis [mailto:daz...@gmail.com]
> Sent: 23 November 2016 10:17
> To: n...@fisk.me.uk
> Cc: ceph-users 
> Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
> 
> Hi,
> 
> 
> Looks like one of my colleagues increased the PG number before it finished. I 
> was flushing the whole cache tier and it's currently stuck
> on ~80 GB of data, because of the OSD crashes. I will look into the hitset 
> counts and check what can be done. Will provide an update if
> I find anything or fix the issue.

So I'm guessing when the PG split, the stats/hit_sets are not how the OSD is 
expecting them to be and causes the crash. I would expect this has been caused 
by the PG splitting rather than introducing extra OSD's. If you manage to get 
things stable by bumping up the hitset count, then you probably want to try and 
do a scrub to try and clean up the stats, which may then stop this happening 
when the hitset comes round to being trimmed again.

> 
> 
> On Wed, Nov 23, 2016 at 12:04 PM, Nick Fisk  wrote:
> > Hi Daznis,
> >
> > I'm not sure how much help I can be, but I will try my best.
> >
> > I think the post-split stats error is probably benign, although I
> > think this suggests you also increased the number of PG's in your cache 
> > pool? If so did you do this before or after you added the
> extra OSD's?  This may have been the cause.
> >
> > On to the actual assert, this looks like it's part of the code which
> > trims the tiering hit set's. I don't understand why its crashing out, but 
> > it must be related to an invalid or missing hitset I would
> imagine.
> >
> > https://github.com/ceph/ceph/blob/v0.94.9/src/osd/ReplicatedPG.cc#L104
> > 85
> >
> > The only thing I could think of from looking at in the code is that
> > the function loops through all hitsets that are above the max number 
> > (hit_set_count). I wonder if setting this number higher would
> mean it won't try and trim any hitsets and let things recover?
> >
> > DISCLAIMER
> > This is a hunch, it might not work or could possibly even make things
> > worse. Otherwise wait for someone who has a better idea to comment.
> >
> > Nick
> >
> >
> >
> >> -Original Message-
> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> >> Of Daznis
> >> Sent: 23 November 2016 05:57
> >> To: ceph-users 
> >> Subject: [ceph-users] Ceph strange issue after adding a cache OSD.
> >>
> >> Hello,
> >>
> >>
> >> The story goes like this.
> >> I have added another 3 drives to the caching layer. OSDs were added
> >> to crush map one by one after each successful rebalance. When
> > I
> >> added the last OSD and went away for about an hour I noticed that
> >> it's still not finished rebalancing. Further investigation
> > showed me
> >> that it one of the older cache SSD was restarting like crazy before
> >> full boot. So I shut it down and waited for a rebalance
> > without that
> >> OSD. Less than an hour later I had another 2 OSD restarting like
> >> crazy. I tried running scrubs on the PG's logs asked me to, but
> > that did
> >> not help. I'm currently stuck with " 8 scrub errors" and a complete dead 
> >> cluster.
> >>
> >> log_channel(cluster) log [WRN] : pg 15.8d has invalid (post-split)
> >> stats; must scrub before tier agent can activate
> >>
> >>
> >> I need help with OSD from crashing. Crash log:
> >>  0> 2016-11-23 06:41:43.365602 7f935b4eb700 -1
> >> osd/ReplicatedPG.cc: In function 'void
> >> ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)'
> >> thread 7f935b4eb700 time 2016-11-23 06:41:43.363067
> >> osd/ReplicatedPG.cc: 10521: FAILED assert(obc)
> >>
> >>  ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
> >>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> >> const*)+0x85) [0xbde2c5]
> >>  2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned
> >> int)+0x75f) [0x87e89f]
> >>  3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f8bb]
> >>  4: (ReplicatedPG::do_op(std::tr1::shared_ptr&)+0xe3a)
> >> [0x8a11aa]
> >>  5: (ReplicatedPG::do_request(std::tr1::shared_ptr&,
> >> ThreadPool::TPHandle&)+0x68a) [0x83c37a]
> >>  6: (OSD::dequeue_op(boost::intrusive_ptr,
> >> std::tr1::shared_ptr, ThreadPool::TPHandle&)+0x405)
> >> [0x69af05]
> >>  7: (OSD::ShardedOpWQ::_process(unsigned int,
> >> ceph::heartbeat_handle_d*)+0x333) [0x69b473]
> >>  8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f)
> >> [0xbcd9cf]
> >>  9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcfb00]
> >>  10: (()+0x7dc5) [0x7f93b9df4dc5]
> >>  11: (clone()+0x6d) [0x7f93b88d5ced]
> >>  NOTE: a copy of the executable, or `objdump -rdS ` is needed 
> >> to interpret this.
> >>
> >>
> >> I have tried looking with  full debug enabled, but those logs didn't
> >> help me much. I have tried to evict the cache layer, but some objects are 
> >> stuck and can't be removed. Any suggestions would be
> greatly appreciated.
> >> __

Re: [ceph-users] deep-scrubbing has large impact on performance

2016-11-23 Thread Nick Fisk
Thanks for the tip Robert, much appreciated.

> -Original Message-
> From: Robert LeBlanc [mailto:rob...@leblancnet.us]
> Sent: 23 November 2016 00:54
> To: Eugen Block 
> Cc: Nick Fisk ; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] deep-scrubbing has large impact on performance
> 
> If you use wpq, I recommend also setting "osd_op_queue_cut_off = high"
> as well, otherwise replication OPs are not weighted and really reduces the 
> benefit of wpq.
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Tue, Nov 22, 2016 at 5:34 AM, Eugen Block  wrote:
> > Thank you!
> >
> >
> > Zitat von Nick Fisk :
> >
> >>> -Original Message-
> >>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> >>> Behalf Of Eugen Block
> >>> Sent: 22 November 2016 10:11
> >>> To: Nick Fisk 
> >>> Cc: ceph-users@lists.ceph.com
> >>> Subject: Re: [ceph-users] deep-scrubbing has large impact on
> >>> performance
> >>>
> >>> Thanks for the very quick answer!
> >>>
> >>> > If you are using Jewel
> >>>
> >>> We are still using Hammer (0.94.7), we wanted to upgrade to Jewel in
> >>> a couple of weeks, would you recommend to do it now?
> >>
> >>
> >> It's been fairly solid for me, but you might want to wait for the
> >> scrubbing hang bug to be fixed before upgrading. I think this might
> >> be fixed in the upcoming 10.2.4 release.
> >>
> >>>
> >>>
> >>> Zitat von Nick Fisk :
> >>>
> >>> >> -Original Message-
> >>> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> >>> >> Behalf Of Eugen Block
> >>> >> Sent: 22 November 2016 09:55
> >>> >> To: ceph-users@lists.ceph.com
> >>> >> Subject: [ceph-users] deep-scrubbing has large impact on
> >>> >> performance
> >>> >>
> >>> >> Hi list,
> >>> >>
> >>> >> I've been searching the mail archive and the web for some help. I
> >>> >> tried the things I found, but I can't see the effects. We use
> >>> > Ceph for
> >>> >> our Openstack environment.
> >>> >>
> >>> >> When our cluster (2 pools, each 4092 PGs, in 20 OSDs on 4 nodes,
> >>> >> 3
> >>> >> MONs) starts deep-scrubbing, it's impossible to work with the VMs.
> >>> >> Currently, the deep-scrubs happen to start on Monday, which is
> >>> >> unfortunate. I already plan to start the next deep-scrub on
> >>> > Saturday,
> >>> >> so it has no impact on our work days. But if I imagine we had a
> >>> >> large multi-datacenter, such performance breaks are not
> >>> > reasonable. So
> >>> >> I'm wondering how do you guys manage that?
> >>> >>
> >>> >> What I've tried so far:
> >>> >>
> >>> >> ceph tell osd.* injectargs '--osd_scrub_sleep 0.1'
> >>> >> ceph tell osd.* injectargs '--osd_disk_thread_ioprio_priority 7'
> >>> >> ceph tell osd.* injectargs '--osd_disk_thread_ioprio_class idle'
> >>> >> ceph tell osd.* injectargs '--osd_scrub_begin_hour 0'
> >>> >> ceph tell osd.* injectargs '--osd_scrub_end_hour 7'
> >>> >>
> >>> >> And I also added these options to the ceph.conf.
> >>> >> To be able to work again, I had to set the nodeep-scrub option
> >>> >> and unset it when I left the office. Today, I see the cluster
> >>> >> deep- scrubbing again, but only one PG at a time, it seems that
> >>> >> now the default for osd_max_scrubs is working now and I don't see
> >>> >> major impacts yet.
> >>> >>
> >>> >> But is there something else I can do to reduce the performance impact?
> >>> >
> >>> > If you are using Jewel, the scrubing is now done in the client IO
> >>> > thread, so those disk thread options won't do anything. Instead
> >>> > there is a new priority setting, which seems to work for me, along
> >>> > with a few other settings.
> >>> >
> >>> > osd_scrub_priority = 1
> >>> > osd_scrub_sleep = .1
> >>> > osd_scrub_chunk_min = 1
> >>> > osd_scrub_chunk_max = 5
> >>> > osd_scrub_load_threshold = 5
> >>> >
> >>> > Also enabling the weighted priority queue can assist the new
> >>> > priority options
> >>> >
> >>> > osd_op_queue = wpq
> >>> >
> >>> >
> >>> >> I just found [1] and will have a look into it.
> >>> >>
> >>> >> [1] http://prob6.com/en/ceph-pg-deep-scrub-cron/
> >>> >>
> >>> >> Thanks!
> >>> >> Eugen
> >>> >>
> >>> >> --
> >>> >> Eugen Block voice   : +49-40-559 51 75
> >>> >> NDE Netzdesign und -entwicklung AG  fax : +49-40-559 51 77
> >>> >> Postfach 61 03 15
> >>> >> D-22423 Hamburg e-mail  : ebl...@nde.ag
> >>> >>
> >>> >>  Vorsitzende des Aufsichtsrates: Angelika Mozdzen
> >>> >>Sitz und Registergericht: Hamburg, HRB 90934
> >>> >>Vorstand: Jens-U. Mozdzen
> >>> >> USt-IdNr. DE 814 013 983
> >>> >>
> >>> >> ___
> >>> >> ceph-users mailing list
> >>> >> ceph-users@lists.ceph.com
> >>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >>>
> >>>
> >>> --
> >>> Eugen Block voice   : +49-40-559 51 75
> >>> NDE Netzdesign u

Re: [ceph-users] ceph cluster having blocke requests very frequently

2016-11-23 Thread Thomas Danan
Hi all,

Still not able to find any explanation to this issue.

I recently tested the network and I am seeing some retransmit being done in the 
output of iperf. But overall the bandwidth durin a test of 10sec is around 7 to 
8 Gbps.
I was not sure to understand if it was the test itself who was overloading the 
network or if my network switches that were having an issue.
Switches have been checked and they are showing no congestion issues or other 
errors.

I really don’t know what to check or test, any idea is more than welcomed …

Thomas

From: Thomas Danan
Sent: vendredi 18 novembre 2016 17:12
To: 'n...@fisk.me.uk'; 'Peter Maloney'
Cc: ceph-users@lists.ceph.com
Subject: RE: [ceph-users] ceph cluster having blocke requests very frequently

Hi Nick,

Here are some logs. The system is in IST TZ and I have filtered the logs to get 
only 2 last hours during which we can observe the issue.

In that particular case, issue is illustrated with the following OSDs

Primary:
ID:607
PID:2962227
HOST:10.137.81.18

Secondary1
ID:528
PID:3721728
HOST:10.137.78.194

Secondary2
ID:771
PID:2806795
HOST:10.137.81.25

In that specific example, first slow request message is detected at 16:18

2016-11-18 16:18:51.991185 7f13acd8a700  0 log_channel(cluster) log [WRN] : 7 
slow requests, 7 included below; oldest blocked for > 30.521107 secs
2016-11-18 16:18:51.991213 7f13acd8a700  0 log_channel(cluster) log [WRN] : 
slow request 30.521107 seconds old, received at 2016-11-18 16:18:21.469965: 
osd_op(client.2406870.1:140440919 rbd_data.616bf2ae8944a.002b85a7 
[set-alloc-hint object_size 4194304 write_size 4194304,write 1449984~524288] 
0.4e69d0de snapc 218=[218,1fb,1df] ondisk+write e212564) currently waiting for 
subops from 528,771

I see that it is about replicating a 4MB Object with snapc context but in my 
environment I have no snapshot (actually they were all deleted). Also I was 
said those messages were not necessary related to object replication to 
snapshot image.
Each time I have a slow request message it is formatted as described with 4MB 
Object and snapc context

Rados df is showing me that I have 4 cloned objects, I do not understand why.

15 minutes later seems ops are unblocked after initiating reconnect message

2016-11-18 16:34:38.120918 7f13acd8a700  0 log_channel(cluster) log [WRN] : 
slow request 960.264008 seconds old, received at 2016-11-18 16:18:37.856850: 
osd_op(client.2406634.1:104826541 rbd_data.636fe2ae8944a.00111eec 
[set-alloc-hint object_size 4194304 write_size 4194304,write 4112384~4096] 
0.f56e90de snapc 426=[426,3f9] ondisk+write e212564) currently waiting for 
subops from 528,771
2016-11-18 16:34:46.863383 7f137705a700  0 -- 10.137.81.18:6840/2962227 >> 
10.137.81.135:0/748393319 pipe(0x293bd000 sd=35 :6840 s=0 pgs=0 cs=0 l=0 
c=0x21405020).accept peer addr is really 10.137.81.135:0/748393319 (socket is 
10.137.81.135:26749/0)
2016-11-18 16:35:05.048420 7f138fea6700  0 -- 192.168.228.36:6841/2962227 >> 
192.168.228.28:6805/3721728 pipe(0x1271b000 sd=34 :50711 s=2 pgs=647 cs=5 l=0 
c=0x42798c0).fault, initiating reconnect

I do not manage to identify anything obvious in the logs.

Thanks for your help …

Thomas


From: Nick Fisk [mailto:n...@fisk.me.uk]
Sent: jeudi 17 novembre 2016 11:02
To: Thomas Danan; n...@fisk.me.uk; 'Peter Maloney'
Cc: ceph-users@lists.ceph.com
Subject: RE: [ceph-users] ceph cluster having blocke requests very frequently

Hi Thomas,

Do you have the OSD logs from around the time of that slow request (13:12 to 
13:29 period)?

Do you also see anything about OSD’s going down in the Mon ceph.log file around 
that time?

480 seconds is probably far too long for a disk to be busy for, I’m wondering 
if the OSD is either dying and respawning or if you are running out of some 
type of system resource….eg TCP connections or something like that, which means 
the OSD’s can’t communicate with each other.

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Thomas 
Danan
Sent: 17 November 2016 08:59
To: n...@fisk.me.uk; 'Peter Maloney' 
mailto:peter.malo...@brockmann-consult.de>>
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] ceph cluster having blocke requests very frequently

Hi,

I have recheck the pattern when slow request are detected.

I have an example with following (primary: 411, secondary: 176, 594)
On primary slow requests detected: waiting for subops (176, 594)  during 16 
minutes 

2016-11-17 13:29:27.209754 7f001d414700 0 log_channel(cluster) log [WRN] : 7 
slow requests, 7 included below; oldest blocked for > 480.477315 secs
2016-11-17 13:29:27.209777 7f001d414700 0 log_channel(cluster) log [WRN] : slow 
request 480.477315 seconds old, received at 2016-11-17 13:21:26.732303: 
osd_op(client.2407558.1:206455044 rbd_data.66ea12ae8944a.001acbbc 
[set-alloc-hint object_size 4194304 write_size 4194304,w

Re: [ceph-users] cephfs (rbd) read performance low - where is the bottleneck?

2016-11-23 Thread Mark Nelson

Hi Mike,

Single threaded synchronous sequential read is a difficult case.  You 
are even worse off in your specific situation:


a) you only have a single request in flight
b) you have many (slow) underlying devices
c) you don't expect data to already be in cache
d) you still suffer from any potential underlying file fragmentation

As Eric said, the fundamental problem here is that you're reading from a 
single OSD at a time.  For traditional 7200RPM SATA disks, your 
performance is in the right ballpark, if a bit lower than what the 
drives can natively do.  The most straightforward way to improve 
performance in this case is going to be to use faster OSD disks or have 
your application fetch multiple chunks of the file in advance.  If SSDs 
are out of the question, you could try RAID under the OSDs, but that's 
not really something we support at RH.


You could try doing reads much larger than the underlying object size 
and/or screwing with the client side readahead as you mentioned.  Also, 
if your application can use POSIX_FADV_SEQUENTIAL that will double the 
readahead used afaik.


Mark

On 11/23/2016 12:38 AM, Mike Miller wrote:

Hi,

did some testing multithreaded access and dd, performance scales as it
should.

Any ideas to improve single threaded read performance further would be
highly appreciated. Some of our use cases requires that we need to read
large files by a single thread.

I have tried changing the readahead on the kernel client cephfs mount
too, rsize and rasize.

mount.ceph ... -o name=cephfs,secretfile=secret.key,rsize=67108864

Doing this on kernel 4.5.2 gives the error message:
"ceph: Unknown mount option rsize"
or unknown rasize.

Can someone explain to me how I can experiment with readahead on cephfs?

Mike

On 11/21/16 12:33 PM, Eric Eastman wrote:

Have you looked at your file layout?

On a test cluster running 10.2.3 I created a 5GB file and then looked
at the layout:

# ls -l test.dat
  -rw-r--r-- 1 root root 524288 Nov 20 23:09 test.dat
# getfattr -n ceph.file.layout test.dat
  # file: test.dat
  ceph.file.layout="stripe_unit=4194304 stripe_count=1
object_size=4194304 pool=cephfs_data"

From what I understand with this layout you are reading 4MB of data
from 1 OSD at a time so I think you are seeing the overall speed of a
single SATA drive.  I do not think increasing your MON/MDS links to
10Gb will help, nor for a single file read will it help by going to
SSD for the metadata.

To test this, you may want to try creating 10 x 50GB files, and then
read them in parallel and see if your overall throughput increases.
If so, take a look at the layout parameters and see if you can change
the file layout to get more parallelization.

https://github.com/ceph/ceph/blob/master/doc/dev/file-striping.rst
https://github.com/ceph/ceph/blob/master/doc/cephfs/file-layouts.rst

Regards,
Eric

On Sun, Nov 20, 2016 at 3:24 AM, Mike Miller 
wrote:

Hi,

reading a big file 50 GB (tried more too)

dd if=bigfile of=/dev/zero bs=4M

in a cluster with 112 SATA disks in 10 osd (6272 pgs, replication 3)
gives
me only about *122 MB/s* read speed in single thread. Scrubbing
turned off
during measurement.

I have been searching for possible bottlenecks. The network is not the
problem, the machine running dd is connected to the cluster public
network
with a 20 GBASE-T bond. osd dual network: cluster public 10 GBASE-T,
private
10 GBASE-T.

The osd SATA disks are utilized only up until about 10% or 20%, not more
than that. CPUs on osd idle too. CPUs on mon idle, mds usage about
1.0 (1
core is used on this 6-core machine). mon and mds connected with only
1 GbE
(I would expect some latency from that, but no bandwidth issues; in fact
network bandwidth is about 20 Mbit max).

If I read a file with 50 GB, then clear the cache on the reading machine
(but not the osd caches), I get much better reading performance of about
*620 MB/s*. That seems logical to me as much (most) of the data is
still in
the osd cache buffers. But still the read performance is not super
considered that the reading machine is connected to the cluster with
a 20
Gbit/s bond.

How can I improve? I am not really sure, but from my understanding 2
possible bottlenecks come to mind:

1) 1 GbE connection to mon / mds

Is this the reason why reads are slow and osd disks are not hammered
by read
requests and therewith fully utilized?

2) Move metadata to SSD

Currently, cephfs_metadata is on the same pool as the data on the
spinning
SATA disks. Is this the bottleneck? Is the move of metadata to SSD a
solution?

Or is it both?

Your experience and insight are highly appreciated.

Thanks,

Mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
c

[ceph-users] ceph-mon running but i cant connect to cluster

2016-11-23 Thread pascal.bous...@cea.fr

Hello,

After upgrade to 0.94.9 from 0.94.7

I can contact Ceph-mon daemon with :
Ceph daemon mon.node-11 mon-status
Return { ... « state » : « probing »

And when i try to check ceph cluster : ceph -s , no response and after ctrl+C 
--> Error connecting to cluster

Before this upgrade the cluster worked fine in production.


[global]
fsid = 7da0b7a9-baf3-4766-ab80-e98c70dc5f8a
mon_initial_members = node-11, node-12, node-21,node-22,node-31
mon_host = 100.15.18.171 ,100.15.18.172 ,100.15.18.175, 100.15.18.176, 
100.15.18.179
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
public network = 100.15.0.0/16
cluster network = 10.15.0.0/16
osd pool default size = 3  # Write an object 3 times
#nombre de placement group par defaut pour un pool
# 100 par osd
# (100*12*7)/3
osd pool default pg num = 4096
osd pool default pgp num = 4096
debug mon = 10
debug ms = 1

[mon.node-11]
  host = node-11
  mon addr = 100.15.18.171:6789
[mon.1]
  host = node-12
  mon addr = 100.15.18.172:6789
[mon.2]
  host = node-21
  mon addr = 100.15.18.175:6789
[mon.3]
  host = node-22
  mon addr = 100.15.18.176:6789
[mon.4]
  host = node-31
  mon addr = 100.15.18.179:6789
[osd]
  osd mkfs type = xfs
  osd mount options xfs = rw,noatime,inode64,logbsize=256k,delaylog
  osd mkfs options xfs = -f -i size=2048
  osd recovery max activ = 1
  osd max backfills = 1
[osd.0]
host = node-11
[osd.1]
host = node-11
..

Tail ceph-mon.node-11.log

2016-11-23 12:25:50.494523 7feb10cd1700  5 
mon.node-11@0(probing) e5 waitlisting message 
auth(proto 0 27 bytes epoch 0) v1
2016-11-23 12:25:50.516225 7feb10cd1700 10 
mon.node-11@0(probing) e5 ms_handle_reset 
0x5249700 100.15.36.48:6804/7890
2016-11-23 12:25:50.569899 7feb0ac8d700  1 -- 100.15.18.171:6789/0 >> :/0 
pipe(0x58e8000 sd=43 :6789 s=0 pgs=0 cs=0 l=0 c=0x55a1e40).accept sd=43 
100.15.36.40:37051/0
2016-11-23 12:25:50.570052 7feb0ac8d700 10 
mon.node-11@0(probing) e5 ms_verify_authorizer 
100.15.36.40:6820/7307 osd protocol 0
2016-11-23 12:25:50.570362 7feb10cd1700  1 -- 100.15.18.171:6789/0 <== osd.144 
100.15.36.40:6820/7307 1  auth(proto 0 28 bytes epoch 0) v1  58+0+0 
(2228332661 0 0) 0x57bcf40 con 0x55a1e40
2016-11-23 12:25:50.570393 7feb10cd1700  5 
mon.node-11@0(probing) e5 waitlisting message 
auth(proto 0 28 bytes epoch 0) v1
2016-11-23 12:25:50.572652 7feb0dabb700  1 -- 100.15.18.171:6789/0 >> :/0 
pipe(0x567a000 sd=80 :6789 s=0 pgs=0 cs=0 l=0 c=0x55a0f20).accept sd=80 
100.15.18.175:56971/0
2016-11-23 12:25:50.572854 7feb0dabb700 10 
mon.node-11@0(probing) e5 ms_verify_authorizer 
100.15.18.175:6808/7432 osd protocol 0
2016-11-23 12:25:50.573223 7feb10cd1700  1 -- 100.15.18.171:6789/0 <== osd.57 
100.15.18.175:6808/7432 1  auth(proto 0 27 bytes epoch 0) v1  57+0+0 
(1006143703 0 0) 0x57bd180 con 0x55a0f20
2016-11-23 12:25:50.573257 7feb10cd1700  5 
mon.node-11@0(probing) e5 waitlisting message 
auth(proto 0 27 bytes epoch 0) v1


Regards

Pascal BOUSTIE
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph strange issue after adding a cache OSD.

2016-11-23 Thread Daznis
Thank you. That helped quite a lot. Now I'm just stuck with one OSD
crashing with:

osd/PG.cc: In function 'static int PG::peek_map_epoch(ObjectStore*,
spg_t, epoch_t*, ceph::bufferlist*)' thread 7f36bbdd6880 time
2016-11-23 13:42:43.27
8539
osd/PG.cc: 2911: FAILED assert(r > 0)

 ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0xbde2c5]
 2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*,
ceph::buffer::list*)+0x8ba) [0x7cf4da]
 3: (OSD::load_pgs()+0x9ef) [0x6bd31f]
 4: (OSD::init()+0x181a) [0x6c0e8a]
 5: (main()+0x29dd) [0x6484bd]
 6: (__libc_start_main()+0xf5) [0x7f36b916bb15]
 7: /usr/bin/ceph-osd() [0x661ea9]

On Wed, Nov 23, 2016 at 12:31 PM, Nick Fisk  wrote:
>> -Original Message-
>> From: Daznis [mailto:daz...@gmail.com]
>> Sent: 23 November 2016 10:17
>> To: n...@fisk.me.uk
>> Cc: ceph-users 
>> Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
>>
>> Hi,
>>
>>
>> Looks like one of my colleagues increased the PG number before it finished. 
>> I was flushing the whole cache tier and it's currently stuck
>> on ~80 GB of data, because of the OSD crashes. I will look into the hitset 
>> counts and check what can be done. Will provide an update if
>> I find anything or fix the issue.
>
> So I'm guessing when the PG split, the stats/hit_sets are not how the OSD is 
> expecting them to be and causes the crash. I would expect this has been 
> caused by the PG splitting rather than introducing extra OSD's. If you manage 
> to get things stable by bumping up the hitset count, then you probably want 
> to try and do a scrub to try and clean up the stats, which may then stop this 
> happening when the hitset comes round to being trimmed again.
>
>>
>>
>> On Wed, Nov 23, 2016 at 12:04 PM, Nick Fisk  wrote:
>> > Hi Daznis,
>> >
>> > I'm not sure how much help I can be, but I will try my best.
>> >
>> > I think the post-split stats error is probably benign, although I
>> > think this suggests you also increased the number of PG's in your cache 
>> > pool? If so did you do this before or after you added the
>> extra OSD's?  This may have been the cause.
>> >
>> > On to the actual assert, this looks like it's part of the code which
>> > trims the tiering hit set's. I don't understand why its crashing out, but 
>> > it must be related to an invalid or missing hitset I would
>> imagine.
>> >
>> > https://github.com/ceph/ceph/blob/v0.94.9/src/osd/ReplicatedPG.cc#L104
>> > 85
>> >
>> > The only thing I could think of from looking at in the code is that
>> > the function loops through all hitsets that are above the max number 
>> > (hit_set_count). I wonder if setting this number higher would
>> mean it won't try and trim any hitsets and let things recover?
>> >
>> > DISCLAIMER
>> > This is a hunch, it might not work or could possibly even make things
>> > worse. Otherwise wait for someone who has a better idea to comment.
>> >
>> > Nick
>> >
>> >
>> >
>> >> -Original Message-
>> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
>> >> Of Daznis
>> >> Sent: 23 November 2016 05:57
>> >> To: ceph-users 
>> >> Subject: [ceph-users] Ceph strange issue after adding a cache OSD.
>> >>
>> >> Hello,
>> >>
>> >>
>> >> The story goes like this.
>> >> I have added another 3 drives to the caching layer. OSDs were added
>> >> to crush map one by one after each successful rebalance. When
>> > I
>> >> added the last OSD and went away for about an hour I noticed that
>> >> it's still not finished rebalancing. Further investigation
>> > showed me
>> >> that it one of the older cache SSD was restarting like crazy before
>> >> full boot. So I shut it down and waited for a rebalance
>> > without that
>> >> OSD. Less than an hour later I had another 2 OSD restarting like
>> >> crazy. I tried running scrubs on the PG's logs asked me to, but
>> > that did
>> >> not help. I'm currently stuck with " 8 scrub errors" and a complete dead 
>> >> cluster.
>> >>
>> >> log_channel(cluster) log [WRN] : pg 15.8d has invalid (post-split)
>> >> stats; must scrub before tier agent can activate
>> >>
>> >>
>> >> I need help with OSD from crashing. Crash log:
>> >>  0> 2016-11-23 06:41:43.365602 7f935b4eb700 -1
>> >> osd/ReplicatedPG.cc: In function 'void
>> >> ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)'
>> >> thread 7f935b4eb700 time 2016-11-23 06:41:43.363067
>> >> osd/ReplicatedPG.cc: 10521: FAILED assert(obc)
>> >>
>> >>  ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
>> >>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> >> const*)+0x85) [0xbde2c5]
>> >>  2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned
>> >> int)+0x75f) [0x87e89f]
>> >>  3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f8bb]
>> >>  4: (ReplicatedPG::do_op(std::tr1::shared_ptr&)+0xe3a)
>> >> [0x8a11aa]
>> >>  5: (ReplicatedPG::do_request(std::tr1::sh

[ceph-users] ERROR: flush_read_list(): d->client_c->handle_data() returned -5

2016-11-23 Thread Riederer, Michael
Hello,

we have 4 ceph radosgws behind a haproxy running ceph version 10.2.3

ceph.conf:
[global]
osd_pool_default_pgp_num = 4096
auth_service_required = cephx
mon_initial_members = 
ceph-203-1-public,ceph-203-2-public,ceph-203-3-public,ceph-203-4-public,ceph-203-7-public
fsid = 69876022-f6fb-4eef-af47-5527cfa1e33a
cluster_network = 10.65.204.0/24
auth_supported = cephx
auth_cluster_required = cephx
mon_host = 
10.65.203.17:6789,10.65.203.18:6789,10.65.203.19:6789,10.65.203.95:6789,10.65.203.98:6789
auth_client_required = cephx
public_network = 10.65.203.0/24

[client.radosgw.ceph-203-rgw-1]
host = ceph-203-rgw-1
keyring = /etc/ceph/ceph.client.radosgw.ceph-203-rgw-1.keyring
rgw_frontends = civetweb port=80
rgw dns name = ceph-203-rgw-1.mm.br.de
rgw print continue = False

Can someone help to find the ERROR:

2016-11-23 13:53:29.801313 7ff1487f0700  1 == starting new request 
req=0x7ff1487ea710 =
2016-11-23 13:53:29.802850 7ff1487f0700  1 == req done req=0x7ff1487ea710 
op status=0 http_status=200 ==
2016-11-23 13:53:29.802901 7ff1487f0700  1 civetweb: 0x7ff244002e90: 
10.65.163.49 - - [23/Nov/2016:13:53:29 +0100] "HEAD 
/mir-Live/3c14cda9-1f5c-49df-92d2-d8cc5ca03472_P.mp4 HTTP/1.1" 200 0 - 
aws-sdk-java/1.11.14 Linux/2.6.32-573.18.1.el6.x86_64 
OpenJDK_64-Bit_Server_VM/25.71-b15/1.8.0_71
2016-11-23 13:53:30.367801 7ff139fd3700  1 == starting new request 
req=0x7ff139fcd710 =
2016-11-23 13:53:30.382091 7ff139fd3700  0 ERROR: flush_read_list(): 
d->client_c->handle_data() returned -5
2016-11-23 13:53:30.382328 7ff139fd3700  0 WARNING: set_req_state_err err_no=5 
resorting to 500
2016-11-23 13:53:30.382389 7ff139fd3700  0 ERROR: s->cio->send_content_length() 
returned err=-5
2016-11-23 13:53:30.382394 7ff139fd3700  0 ERROR: s->cio->print() returned 
err=-5
2016-11-23 13:53:30.382396 7ff139fd3700  0 ERROR: STREAM_IO(s)->print() 
returned err=-5
2016-11-23 13:53:30.382414 7ff139fd3700  0 ERROR: 
STREAM_IO(s)->complete_header() returned err=-5
2016-11-23 13:53:30.382459 7ff139fd3700  1 == req done req=0x7ff139fcd710 
op status=-5 http_status=500 ==
2016-11-23 13:53:30.382541 7ff139fd3700  1 civetweb: 0x7ff2040008e0: 
10.65.163.49 - - [23/Nov/2016:13:53:30 +0100] "GET 
/mir-Live/3c14cda9-1f5c-49df-92d2-d8cc5ca03472_2.mp4 HTTP/1.1" 500 0 - 
aws-sdk-java/1.11.14 Linux/2.6.32-573.18.1.el6.x86_64 
OpenJDK_64-Bit_Server_VM/25.71-b15/1.8.0_71

Regards
Michael
--
Bayerischer Rundfunk; Rundfunkplatz 1; 80335 München
Telefon: +49 89 590001; E-Mail: i...@br.de; Website: http://www.BR.de
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph strange issue after adding a cache OSD.

2016-11-23 Thread Nick Fisk
Sorry, I'm afraid I'm out of ideas about that one, that error doesn't mean very 
much to me. The code suggests the OSD is trying to get an attr from the 
disk/filesystem, but for some reason it doesn't like that. You could maybe 
whack the debug logging for OSD and filestore up to max and try and see what 
PG/file is accessed just before the crash, but I'm not sure what the fix would 
be, even if you manage to locate the dodgy PG.

Does the cluster have all PG's recovered now? Unless anyone else can comment, 
you might be best removing/wiping and then re-adding the OSD.

> -Original Message-
> From: Daznis [mailto:daz...@gmail.com]
> Sent: 23 November 2016 12:55
> To: Nick Fisk 
> Cc: ceph-users 
> Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
> 
> Thank you. That helped quite a lot. Now I'm just stuck with one OSD crashing 
> with:
> 
> osd/PG.cc: In function 'static int PG::peek_map_epoch(ObjectStore*, spg_t, 
> epoch_t*, ceph::bufferlist*)' thread 7f36bbdd6880 time
> 2016-11-23 13:42:43.27
> 8539
> osd/PG.cc: 2911: FAILED assert(r > 0)
> 
>  ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x85) [0xbde2c5]
>  2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*,
> ceph::buffer::list*)+0x8ba) [0x7cf4da]
>  3: (OSD::load_pgs()+0x9ef) [0x6bd31f]
>  4: (OSD::init()+0x181a) [0x6c0e8a]
>  5: (main()+0x29dd) [0x6484bd]
>  6: (__libc_start_main()+0xf5) [0x7f36b916bb15]
>  7: /usr/bin/ceph-osd() [0x661ea9]
> 
> On Wed, Nov 23, 2016 at 12:31 PM, Nick Fisk  wrote:
> >> -Original Message-
> >> From: Daznis [mailto:daz...@gmail.com]
> >> Sent: 23 November 2016 10:17
> >> To: n...@fisk.me.uk
> >> Cc: ceph-users 
> >> Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
> >>
> >> Hi,
> >>
> >>
> >> Looks like one of my colleagues increased the PG number before it
> >> finished. I was flushing the whole cache tier and it's currently
> >> stuck on ~80 GB of data, because of the OSD crashes. I will look into the 
> >> hitset counts and check what can be done. Will provide an
> update if I find anything or fix the issue.
> >
> > So I'm guessing when the PG split, the stats/hit_sets are not how the OSD 
> > is expecting them to be and causes the crash. I would
> expect this has been caused by the PG splitting rather than introducing extra 
> OSD's. If you manage to get things stable by bumping up
> the hitset count, then you probably want to try and do a scrub to try and 
> clean up the stats, which may then stop this happening when
> the hitset comes round to being trimmed again.
> >
> >>
> >>
> >> On Wed, Nov 23, 2016 at 12:04 PM, Nick Fisk  wrote:
> >> > Hi Daznis,
> >> >
> >> > I'm not sure how much help I can be, but I will try my best.
> >> >
> >> > I think the post-split stats error is probably benign, although I
> >> > think this suggests you also increased the number of PG's in your
> >> > cache pool? If so did you do this before or after you added the
> >> extra OSD's?  This may have been the cause.
> >> >
> >> > On to the actual assert, this looks like it's part of the code
> >> > which trims the tiering hit set's. I don't understand why its
> >> > crashing out, but it must be related to an invalid or missing
> >> > hitset I would
> >> imagine.
> >> >
> >> > https://github.com/ceph/ceph/blob/v0.94.9/src/osd/ReplicatedPG.cc#L
> >> > 104
> >> > 85
> >> >
> >> > The only thing I could think of from looking at in the code is that
> >> > the function loops through all hitsets that are above the max
> >> > number (hit_set_count). I wonder if setting this number higher
> >> > would
> >> mean it won't try and trim any hitsets and let things recover?
> >> >
> >> > DISCLAIMER
> >> > This is a hunch, it might not work or could possibly even make
> >> > things worse. Otherwise wait for someone who has a better idea to 
> >> > comment.
> >> >
> >> > Nick
> >> >
> >> >
> >> >
> >> >> -Original Message-
> >> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> >> >> Behalf Of Daznis
> >> >> Sent: 23 November 2016 05:57
> >> >> To: ceph-users 
> >> >> Subject: [ceph-users] Ceph strange issue after adding a cache OSD.
> >> >>
> >> >> Hello,
> >> >>
> >> >>
> >> >> The story goes like this.
> >> >> I have added another 3 drives to the caching layer. OSDs were
> >> >> added to crush map one by one after each successful rebalance.
> >> >> When
> >> > I
> >> >> added the last OSD and went away for about an hour I noticed that
> >> >> it's still not finished rebalancing. Further investigation
> >> > showed me
> >> >> that it one of the older cache SSD was restarting like crazy
> >> >> before full boot. So I shut it down and waited for a rebalance
> >> > without that
> >> >> OSD. Less than an hour later I had another 2 OSD restarting like
> >> >> crazy. I tried running scrubs on the PG's logs asked me to, but
> >> > that did
> >> >> not help. I'm currently st

Re: [ceph-users] ceph cluster having blocke requests very frequently

2016-11-23 Thread Nick Fisk
Hi Thomas,

 

I’m afraid I can’t offer anymore advice, they isn’t anything that I can see 
which could be the trigger. I know we spoke about downgrading the kernel, did 
you manage to try that?

 

Nick

 

From: Thomas Danan [mailto:thomas.da...@mycom-osi.com] 
Sent: 23 November 2016 11:29
To: n...@fisk.me.uk; 'Peter Maloney' 
Cc: ceph-users@lists.ceph.com
Subject: RE: [ceph-users] ceph cluster having blocke requests very frequently

 

Hi all,

 

Still not able to find any explanation to this issue.

 

I recently tested the network and I am seeing some retransmit being done in the 
output of iperf. But overall the bandwidth durin a test of 10sec is around 7 to 
8 Gbps.

I was not sure to understand if it was the test itself who was overloading the 
network or if my network switches that were having an issue.

Switches have been checked and they are showing no congestion issues or other 
errors.

 

I really don’t know what to check or test, any idea is more than welcomed …

 

Thomas 

 

From: Thomas Danan 
Sent: vendredi 18 novembre 2016 17:12
To: 'n...@fisk.me.uk'; 'Peter Maloney'
Cc: ceph-users@lists.ceph.com
Subject: RE: [ceph-users] ceph cluster having blocke requests very frequently

 

Hi Nick,

 

Here are some logs. The system is in IST TZ and I have filtered the logs to get 
only 2 last hours during which we can observe the issue.

 

In that particular case, issue is illustrated with the following OSDs

 

Primary:

ID:607

PID:2962227

HOST:10.137.81.18

 

Secondary1

ID:528

PID:3721728

HOST:10.137.78.194

 

Secondary2

ID:771

PID:2806795

HOST:10.137.81.25

 

In that specific example, first slow request message is detected at 16:18

 

2016-11-18 16:18:51.991185 7f13acd8a700  0 log_channel(cluster) log [WRN] : 7 
slow requests, 7 included below; oldest blocked for > 30.521107 secs

2016-11-18 16:18:51.991213 7f13acd8a700  0 log_channel(cluster) log [WRN] : 
slow request 30.521107 seconds old, received at 2016-11-18 16:18:21.469965: 
osd_op(client.2406870.1:140440919 rbd_data.616bf2ae8944a.002b85a7 
[set-alloc-hint object_size 4194304 write_size 4194304,write 1449984~524288] 
0.4e69d0de snapc 218=[218,1fb,1df] ondisk+write e212564) currently waiting for 
subops from 528,771

 

I see that it is about replicating a 4MB Object with snapc context but in my 
environment I have no snapshot (actually they were all deleted). Also I was 
said those messages were not necessary related to object replication to 
snapshot image.

Each time I have a slow request message it is formatted as described with 4MB 
Object and snapc context

 

Rados df is showing me that I have 4 cloned objects, I do not understand why.

 

15 minutes later seems ops are unblocked after initiating reconnect message

 

2016-11-18 16:34:38.120918 7f13acd8a700  0 log_channel(cluster) log [WRN] : 
slow request 960.264008 seconds old, received at 2016-11-18 16:18:37.856850: 
osd_op(client.2406634.1:104826541 rbd_data.636fe2ae8944a.00111eec 
[set-alloc-hint object_size 4194304 write_size 4194304,write 4112384~4096] 
0.f56e90de snapc 426=[426,3f9] ondisk+write e212564) currently waiting for 
subops from 528,771

2016-11-18 16:34:46.863383 7f137705a700  0 -- 10.137.81.18:6840/2962227 >> 
10.137.81.135:0/748393319 pipe(0x293bd000 sd=35 :6840 s=0 pgs=0 cs=0 l=0 
c=0x21405020).accept peer addr is really 10.137.81.135:0/748393319 (socket is 
10.137.81.135:26749/0)

2016-11-18 16:35:05.048420 7f138fea6700  0 -- 192.168.228.36:6841/2962227 >> 
192.168.228.28:6805/3721728 pipe(0x1271b000 sd=34 :50711 s=2 pgs=647 cs=5 l=0 
c=0x42798c0).fault, initiating reconnect

 

I do not manage to identify anything obvious in the logs.

 

Thanks for your help …

 

Thomas

 

 

From: Nick Fisk [mailto:n...@fisk.me.uk] 
Sent: jeudi 17 novembre 2016 11:02
To: Thomas Danan; n...@fisk.me.uk  ; 'Peter Maloney'
Cc: ceph-users@lists.ceph.com  
Subject: RE: [ceph-users] ceph cluster having blocke requests very frequently

 

Hi Thomas,

 

Do you have the OSD logs from around the time of that slow request (13:12 to 
13:29 period)?

 

Do you also see anything about OSD’s going down in the Mon ceph.log file around 
that time?

 

480 seconds is probably far too long for a disk to be busy for, I’m wondering 
if the OSD is either dying and respawning or if you are running out of some 
type of system resource….eg TCP connections or something like that, which means 
the OSD’s can’t communicate with each other.

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Thomas 
Danan
Sent: 17 November 2016 08:59
To: n...@fisk.me.uk  ; 'Peter Maloney' 
mailto:peter.malo...@brockmann-consult.de> 
>
Cc: ceph-users@lists.ceph.com  
Subject: Re: [ceph-users] ceph cluster having blocke requests very frequently

 

Hi,

 

I have recheck the pattern when slow request are detected.

 

I have an example with following 

Re: [ceph-users] how possible is that ceph cluster crash

2016-11-23 Thread Nick Fisk
Hi Sam,

Would a check in ceph-disk for "nobarrier" in the osd_mount_options_{fstype} 
variable be a good idea? It good either strip it out or
fail to start the OSD unless an override flag is specified somewhere.

Looking at ceph-disk code, I would imagine around here would be the right place 
to put the check
https://github.com/ceph/ceph/blob/master/src/ceph-disk/ceph_disk/main.py#L2642

I don't mind trying to get this done if its felt to be worthwhile.

Nick

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Samuel Just
> Sent: 19 November 2016 00:31
> To: Nick Fisk 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] how possible is that ceph cluster crash
> 
> Many reasons:
> 
> 1) You will eventually get a DC wide power event anyway at which point 
> probably most of the OSDs will have hopelessly corrupted
> internal xfs structures (yes, I have seen this happen to a poor soul with a 
> DC with redundant power).
> 2) Even in the case of a single rack/node power failure, the biggest danger 
> isn't that the OSDs don't start.  It's that they *do
start*, but
> forgot or arbitrarily corrupted a random subset of transactions they told 
> other osds and clients that they committed.  The exact
impact
> would be random, but for sure, any guarantees Ceph normally provides would be 
> out the window.  RBD devices could have random
> byte ranges zapped back in time (not great if they're the offsets assigned to 
> your database or fs journal...) for instance.
> 3) Deliberately powercycling a node counts as a power failure if you don't 
> stop services and sync etc first.
> 
> In other words, don't mess with the definition of "committing a transaction" 
> if you value your data.
> -Sam "just say no" Just
> 
> On Fri, Nov 18, 2016 at 4:04 PM, Nick Fisk  wrote:
> > Yes, because these things happen
> >
> > http://www.theregister.co.uk/2016/11/15/memset_power_cut_service_inter
> > ruption/
> >
> > We had customers who had kit in this DC.
> >
> > To use your analogy, it's like crossing the road at traffic lights but
> > not checking cars have stopped. You might be OK 99%of the time, but
> > sooner or later it will bite you in the arse and it won't be pretty.
> >
> > 
> > From: "Brian ::" 
> > Sent: 18 Nov 2016 11:52 p.m.
> > To: sj...@redhat.com
> > Cc: Craig Chi; ceph-users@lists.ceph.com; Nick Fisk
> > Subject: Re: [ceph-users] how possible is that ceph cluster crash
> >
> >> X-Assp-URIBLcache failed: '1e100.net'(black.uribl.com)
> >> X-Assp-Spam-Level: *
> >> X-Assp-Envelope-From: b...@iptel.co
> >> X-Assp-Intended-For: n...@fisk.me.uk
> >> X-Assp-ID: ASSP.fisk.me.uk (47951-11296)
> >> X-Assp-Version: 1.9.1.4(1.0.00)
> >>
> >>
> >> This is like your mother telling not to cross the road when you were
> >> 4 years of age but not telling you it was because you could be
> >> flattened by a car :)
> >>
> >> Can you expand on your answer? If you are in a DC with AB power,
> >> redundant UPS, dual feed from the electric company, onsite
> >> generators, dual PSU servers, is it still a bad idea?
> >>
> >>
> >>
> >>
> >> On Fri, Nov 18, 2016 at 6:52 PM, Samuel Just  wrote:
> >>>
> >>> Never *ever* use nobarrier with ceph under *any* circumstances.  I
> >>> cannot stress this enough.
> >>> -Sam
> >>>
> >>> On Fri, Nov 18, 2016 at 10:39 AM, Craig Chi 
> >>> wrote:
> 
>  Hi Nick and other Cephers,
> 
>  Thanks for your reply.
> 
> > 2) Config Errors
> > This can be an easy one to say you are safe from. But I would say
> > most outages and data loss incidents I have seen on the mailing
> > lists have been due to poor hardware choice or configuring options
> > such as size=2, min_size=1 or enabling stuff like nobarriers.
> 
> 
>  I am wondering the pros and cons of the nobarrier option used by Ceph.
> 
>  It is well known that nobarrier is dangerous when power outage
>  happens, but if we already have replicas in different racks or
>  PDUs, will Ceph reduce the risk of data lost with this option?
> 
>  I have seen many performance tuning articles providing nobarrier
>  option in xfs, but there are not many of then mention the trade-off
>  of nobarrier.
> 
>  Is it really unacceptable to use nobarrier in production
>  environment? I will be much grateful if you guys are willing to
>  share any experiences about nobarrier and xfs.
> 
>  Sincerely,
>  Craig Chi (Product Developer)
>  Synology Inc. Taipei, Taiwan. Ext. 361
> 
>  On 2016-11-17 05:04, Nick Fisk  wrote:
> 
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > Behalf Of Pedro Benites
> > Sent: 16 November 2016 17:51
> > To: ceph-users@lists.ceph.com
> > Subject: [ceph-users] how possible is that ceph cluster crash
> >
> > Hi,
> >
> > I have a ceph cluster with 5

Re: [ceph-users] Ceph strange issue after adding a cache OSD.

2016-11-23 Thread Daznis
No, it's still missing some PGs and objects and can't recover as it's
blocked by that OSD. I can boot the OSD up by removing all the PG
related files from current directory, but that doesn't solve the
missing objects problem. Not really sure if I can move the object back
to their place manually, but I will try it.

On Wed, Nov 23, 2016 at 3:08 PM, Nick Fisk  wrote:
> Sorry, I'm afraid I'm out of ideas about that one, that error doesn't mean 
> very much to me. The code suggests the OSD is trying to get an attr from the 
> disk/filesystem, but for some reason it doesn't like that. You could maybe 
> whack the debug logging for OSD and filestore up to max and try and see what 
> PG/file is accessed just before the crash, but I'm not sure what the fix 
> would be, even if you manage to locate the dodgy PG.
>
> Does the cluster have all PG's recovered now? Unless anyone else can comment, 
> you might be best removing/wiping and then re-adding the OSD.
>
>> -Original Message-
>> From: Daznis [mailto:daz...@gmail.com]
>> Sent: 23 November 2016 12:55
>> To: Nick Fisk 
>> Cc: ceph-users 
>> Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
>>
>> Thank you. That helped quite a lot. Now I'm just stuck with one OSD crashing 
>> with:
>>
>> osd/PG.cc: In function 'static int PG::peek_map_epoch(ObjectStore*, spg_t, 
>> epoch_t*, ceph::bufferlist*)' thread 7f36bbdd6880 time
>> 2016-11-23 13:42:43.27
>> 8539
>> osd/PG.cc: 2911: FAILED assert(r > 0)
>>
>>  ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x85) [0xbde2c5]
>>  2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*,
>> ceph::buffer::list*)+0x8ba) [0x7cf4da]
>>  3: (OSD::load_pgs()+0x9ef) [0x6bd31f]
>>  4: (OSD::init()+0x181a) [0x6c0e8a]
>>  5: (main()+0x29dd) [0x6484bd]
>>  6: (__libc_start_main()+0xf5) [0x7f36b916bb15]
>>  7: /usr/bin/ceph-osd() [0x661ea9]
>>
>> On Wed, Nov 23, 2016 at 12:31 PM, Nick Fisk  wrote:
>> >> -Original Message-
>> >> From: Daznis [mailto:daz...@gmail.com]
>> >> Sent: 23 November 2016 10:17
>> >> To: n...@fisk.me.uk
>> >> Cc: ceph-users 
>> >> Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
>> >>
>> >> Hi,
>> >>
>> >>
>> >> Looks like one of my colleagues increased the PG number before it
>> >> finished. I was flushing the whole cache tier and it's currently
>> >> stuck on ~80 GB of data, because of the OSD crashes. I will look into the 
>> >> hitset counts and check what can be done. Will provide an
>> update if I find anything or fix the issue.
>> >
>> > So I'm guessing when the PG split, the stats/hit_sets are not how the OSD 
>> > is expecting them to be and causes the crash. I would
>> expect this has been caused by the PG splitting rather than introducing 
>> extra OSD's. If you manage to get things stable by bumping up
>> the hitset count, then you probably want to try and do a scrub to try and 
>> clean up the stats, which may then stop this happening when
>> the hitset comes round to being trimmed again.
>> >
>> >>
>> >>
>> >> On Wed, Nov 23, 2016 at 12:04 PM, Nick Fisk  wrote:
>> >> > Hi Daznis,
>> >> >
>> >> > I'm not sure how much help I can be, but I will try my best.
>> >> >
>> >> > I think the post-split stats error is probably benign, although I
>> >> > think this suggests you also increased the number of PG's in your
>> >> > cache pool? If so did you do this before or after you added the
>> >> extra OSD's?  This may have been the cause.
>> >> >
>> >> > On to the actual assert, this looks like it's part of the code
>> >> > which trims the tiering hit set's. I don't understand why its
>> >> > crashing out, but it must be related to an invalid or missing
>> >> > hitset I would
>> >> imagine.
>> >> >
>> >> > https://github.com/ceph/ceph/blob/v0.94.9/src/osd/ReplicatedPG.cc#L
>> >> > 104
>> >> > 85
>> >> >
>> >> > The only thing I could think of from looking at in the code is that
>> >> > the function loops through all hitsets that are above the max
>> >> > number (hit_set_count). I wonder if setting this number higher
>> >> > would
>> >> mean it won't try and trim any hitsets and let things recover?
>> >> >
>> >> > DISCLAIMER
>> >> > This is a hunch, it might not work or could possibly even make
>> >> > things worse. Otherwise wait for someone who has a better idea to 
>> >> > comment.
>> >> >
>> >> > Nick
>> >> >
>> >> >
>> >> >
>> >> >> -Original Message-
>> >> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
>> >> >> Behalf Of Daznis
>> >> >> Sent: 23 November 2016 05:57
>> >> >> To: ceph-users 
>> >> >> Subject: [ceph-users] Ceph strange issue after adding a cache OSD.
>> >> >>
>> >> >> Hello,
>> >> >>
>> >> >>
>> >> >> The story goes like this.
>> >> >> I have added another 3 drives to the caching layer. OSDs were
>> >> >> added to crush map one by one after each successful rebalance.
>> >> >> When
>> >> > I
>> >> >> added the last OSD

Re: [ceph-users] export-diff behavior if an initial snapshot is NOT specified

2016-11-23 Thread Jason Dillaman
What you are seeing sounds like a side-effect of deep-flatten support.
If you write to an unallocated extent within a cloned image, the
associated object extent must be read from the parent image, modified,
and written to the clone image.

Since the Infernalis release, this process has been tweaked if the
cloned image has a snapshot. In that case, the associated object
extent is still read from the parent, but instead of being modified
and written to the HEAD revision, it is left unmodified and is written
to "pre" snapshot history followed by writing the original
modification (w/o the parent's object extent data) to the HEAD
revision.

This change to the IO path was made to support flattening clones and
dissociating them from their parents even if the clone had snapshots.

Therefore, what you are seeing with export-diff is actually the
backing object extent of data from the parent image written to the
clone's "pre" snapshot history. If you had two snapshots and your
export-diff'ed from the first to second snapshot, you wouldn't see
this extra data.

To your question about how to prepare image B to make sure it will be
exactly the same, the answer is that you don't need to do anything. In
your example above, I am assuming you are manually creating an empty
Image B and using "import-diff" to populate it. The difference in the
export-diff is most likely related to fact that the clone lost its
sparseness on any backing object that was written (e.g. instead of a
one or more 512 byte diffs within a backing object extent, you will
see a single, full-object extent with zeroes where the parent image
had no data).


On Wed, Nov 23, 2016 at 5:06 AM, Zhongyan Gu  wrote:
> Let me make the issue more clear.
> Suppose I cloned image A from a parent image and create snap1 for image A
> and  then make some change of image A.
> If I did the rbd export-diff @snap1. how should I prepare the existing image
> B to make sure it  will be exactly same with image A@snap1 after import-diff
> against this image B.
>
> Thanks,
> Zhongyan
>
>
> On Wed, Nov 23, 2016 at 11:34 AM, Zhongyan Gu  wrote:
>>
>> Thanks Jason, very clear explanation.
>> However, I found some strange behavior when export-diff on a cloned image,
>> not sure it is a bug on calc_snap_set_diff().
>> The test is,
>> Image A is cloned from a parent image. then create snap1 for image A.
>> The content of export-diff A@snap1 will be changed when update image A.
>> Only after image A has no overlap with parent, the content of export-diff
>> A@snap1 is stabled, which is almost zero.
>> I don't think it is a designed behavior. export-diff A@snap1 should always
>> get a stable output no matter image A is cloned or not.
>>
>> Please correct me if anything wrong.
>>
>> Thanks,
>> Zhongyan
>>
>>
>>
>>
>> On Tue, Nov 22, 2016 at 10:31 PM, Jason Dillaman 
>> wrote:
>>>
>>> On Tue, Nov 22, 2016 at 5:31 AM, Zhongyan Gu 
>>> wrote:
>>> > So if initial snapshot is NOT specified, then:
>>> > rbd export-diff image@snap1 will diff all data to snap1. this cmd
>>> > equals to
>>> > :
>>> > rbd export image@snap1. Is my understand right or not??
>>>
>>>
>>> While they will both export all data associated w/ image@snap1, the
>>> "export" command will generate a raw, non-sparse dump of the full
>>> image whereas "export-diff" will export only sections of the image
>>> that contain data. The file generated from "export" can be used with
>>> the "import" command to create a new image, whereas the file generated
>>> from "export-diff" can only be used with "import-diff" against an
>>> existing image.
>>>
>>> --
>>> Jason
>>
>>
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] deep-scrubbing has large impact on performance

2016-11-23 Thread Nick Fisk
Actually this might suggest that caution should be taken before enabling this 
at the moment

http://tracker.ceph.com/issues/15774


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Nick 
> Fisk
> Sent: 23 November 2016 11:17
> To: 'Robert LeBlanc' ; 'Eugen Block' 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] deep-scrubbing has large impact on performance
> 
> Thanks for the tip Robert, much appreciated.
> 
> > -Original Message-
> > From: Robert LeBlanc [mailto:rob...@leblancnet.us]
> > Sent: 23 November 2016 00:54
> > To: Eugen Block 
> > Cc: Nick Fisk ; ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] deep-scrubbing has large impact on
> > performance
> >
> > If you use wpq, I recommend also setting "osd_op_queue_cut_off = high"
> > as well, otherwise replication OPs are not weighted and really reduces the 
> > benefit of wpq.
> > 
> > Robert LeBlanc
> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >
> >
> > On Tue, Nov 22, 2016 at 5:34 AM, Eugen Block  wrote:
> > > Thank you!
> > >
> > >
> > > Zitat von Nick Fisk :
> > >
> > >>> -Original Message-
> > >>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > >>> Behalf Of Eugen Block
> > >>> Sent: 22 November 2016 10:11
> > >>> To: Nick Fisk 
> > >>> Cc: ceph-users@lists.ceph.com
> > >>> Subject: Re: [ceph-users] deep-scrubbing has large impact on
> > >>> performance
> > >>>
> > >>> Thanks for the very quick answer!
> > >>>
> > >>> > If you are using Jewel
> > >>>
> > >>> We are still using Hammer (0.94.7), we wanted to upgrade to Jewel
> > >>> in a couple of weeks, would you recommend to do it now?
> > >>
> > >>
> > >> It's been fairly solid for me, but you might want to wait for the
> > >> scrubbing hang bug to be fixed before upgrading. I think this might
> > >> be fixed in the upcoming 10.2.4 release.
> > >>
> > >>>
> > >>>
> > >>> Zitat von Nick Fisk :
> > >>>
> > >>> >> -Original Message-
> > >>> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > >>> >> Behalf Of Eugen Block
> > >>> >> Sent: 22 November 2016 09:55
> > >>> >> To: ceph-users@lists.ceph.com
> > >>> >> Subject: [ceph-users] deep-scrubbing has large impact on
> > >>> >> performance
> > >>> >>
> > >>> >> Hi list,
> > >>> >>
> > >>> >> I've been searching the mail archive and the web for some help.
> > >>> >> I tried the things I found, but I can't see the effects. We use
> > >>> > Ceph for
> > >>> >> our Openstack environment.
> > >>> >>
> > >>> >> When our cluster (2 pools, each 4092 PGs, in 20 OSDs on 4
> > >>> >> nodes,
> > >>> >> 3
> > >>> >> MONs) starts deep-scrubbing, it's impossible to work with the VMs.
> > >>> >> Currently, the deep-scrubs happen to start on Monday, which is
> > >>> >> unfortunate. I already plan to start the next deep-scrub on
> > >>> > Saturday,
> > >>> >> so it has no impact on our work days. But if I imagine we had a
> > >>> >> large multi-datacenter, such performance breaks are not
> > >>> > reasonable. So
> > >>> >> I'm wondering how do you guys manage that?
> > >>> >>
> > >>> >> What I've tried so far:
> > >>> >>
> > >>> >> ceph tell osd.* injectargs '--osd_scrub_sleep 0.1'
> > >>> >> ceph tell osd.* injectargs '--osd_disk_thread_ioprio_priority 7'
> > >>> >> ceph tell osd.* injectargs '--osd_disk_thread_ioprio_class idle'
> > >>> >> ceph tell osd.* injectargs '--osd_scrub_begin_hour 0'
> > >>> >> ceph tell osd.* injectargs '--osd_scrub_end_hour 7'
> > >>> >>
> > >>> >> And I also added these options to the ceph.conf.
> > >>> >> To be able to work again, I had to set the nodeep-scrub option
> > >>> >> and unset it when I left the office. Today, I see the cluster
> > >>> >> deep- scrubbing again, but only one PG at a time, it seems that
> > >>> >> now the default for osd_max_scrubs is working now and I don't
> > >>> >> see major impacts yet.
> > >>> >>
> > >>> >> But is there something else I can do to reduce the performance 
> > >>> >> impact?
> > >>> >
> > >>> > If you are using Jewel, the scrubing is now done in the client
> > >>> > IO thread, so those disk thread options won't do anything.
> > >>> > Instead there is a new priority setting, which seems to work for
> > >>> > me, along with a few other settings.
> > >>> >
> > >>> > osd_scrub_priority = 1
> > >>> > osd_scrub_sleep = .1
> > >>> > osd_scrub_chunk_min = 1
> > >>> > osd_scrub_chunk_max = 5
> > >>> > osd_scrub_load_threshold = 5
> > >>> >
> > >>> > Also enabling the weighted priority queue can assist the new
> > >>> > priority options
> > >>> >
> > >>> > osd_op_queue = wpq
> > >>> >
> > >>> >
> > >>> >> I just found [1] and will have a look into it.
> > >>> >>
> > >>> >> [1] http://prob6.com/en/ceph-pg-deep-scrub-cron/
> > >>> >>
> > >>> >> Thanks!
> > >>> >> Eugen
> > >>> >>
> > >>> >> --
> > >>> >> Eugen Block voice   : +49-40-559 51 75
> > >>> >> NDE Netzdesign und -entwicklung AG  fax  

Re: [ceph-users] Ceph strange issue after adding a cache OSD.

2016-11-23 Thread Nick Fisk
I take it you have size =2 or min_size=1 or something like that for the cache 
pool? 1 OSD shouldn’t prevent PG's from recovering.

Your best bet would be to see if the PG that is causing the assert can be 
removed and let the OSD start up. If you are lucky, the PG causing the problems 
might not be one which also has unfound objects, otherwise you are likely have 
to get heavily involved in recovering objects with the object store tool.

> -Original Message-
> From: Daznis [mailto:daz...@gmail.com]
> Sent: 23 November 2016 13:56
> To: Nick Fisk 
> Cc: ceph-users 
> Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
> 
> No, it's still missing some PGs and objects and can't recover as it's blocked 
> by that OSD. I can boot the OSD up by removing all the PG
> related files from current directory, but that doesn't solve the missing 
> objects problem. Not really sure if I can move the object back to
> their place manually, but I will try it.
> 
> On Wed, Nov 23, 2016 at 3:08 PM, Nick Fisk  wrote:
> > Sorry, I'm afraid I'm out of ideas about that one, that error doesn't mean 
> > very much to me. The code suggests the OSD is trying to
> get an attr from the disk/filesystem, but for some reason it doesn't like 
> that. You could maybe whack the debug logging for OSD and
> filestore up to max and try and see what PG/file is accessed just before the 
> crash, but I'm not sure what the fix would be, even if you
> manage to locate the dodgy PG.
> >
> > Does the cluster have all PG's recovered now? Unless anyone else can 
> > comment, you might be best removing/wiping and then re-
> adding the OSD.
> >
> >> -Original Message-
> >> From: Daznis [mailto:daz...@gmail.com]
> >> Sent: 23 November 2016 12:55
> >> To: Nick Fisk 
> >> Cc: ceph-users 
> >> Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
> >>
> >> Thank you. That helped quite a lot. Now I'm just stuck with one OSD 
> >> crashing with:
> >>
> >> osd/PG.cc: In function 'static int PG::peek_map_epoch(ObjectStore*,
> >> spg_t, epoch_t*, ceph::bufferlist*)' thread 7f36bbdd6880 time
> >> 2016-11-23 13:42:43.27
> >> 8539
> >> osd/PG.cc: 2911: FAILED assert(r > 0)
> >>
> >>  ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
> >>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> >> const*)+0x85) [0xbde2c5]
> >>  2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*,
> >> ceph::buffer::list*)+0x8ba) [0x7cf4da]
> >>  3: (OSD::load_pgs()+0x9ef) [0x6bd31f]
> >>  4: (OSD::init()+0x181a) [0x6c0e8a]
> >>  5: (main()+0x29dd) [0x6484bd]
> >>  6: (__libc_start_main()+0xf5) [0x7f36b916bb15]
> >>  7: /usr/bin/ceph-osd() [0x661ea9]
> >>
> >> On Wed, Nov 23, 2016 at 12:31 PM, Nick Fisk  wrote:
> >> >> -Original Message-
> >> >> From: Daznis [mailto:daz...@gmail.com]
> >> >> Sent: 23 November 2016 10:17
> >> >> To: n...@fisk.me.uk
> >> >> Cc: ceph-users 
> >> >> Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
> >> >>
> >> >> Hi,
> >> >>
> >> >>
> >> >> Looks like one of my colleagues increased the PG number before it
> >> >> finished. I was flushing the whole cache tier and it's currently
> >> >> stuck on ~80 GB of data, because of the OSD crashes. I will look
> >> >> into the hitset counts and check what can be done. Will provide an
> >> update if I find anything or fix the issue.
> >> >
> >> > So I'm guessing when the PG split, the stats/hit_sets are not how
> >> > the OSD is expecting them to be and causes the crash. I would
> >> expect this has been caused by the PG splitting rather than
> >> introducing extra OSD's. If you manage to get things stable by
> >> bumping up the hitset count, then you probably want to try and do a scrub 
> >> to try and clean up the stats, which may then stop this
> happening when the hitset comes round to being trimmed again.
> >> >
> >> >>
> >> >>
> >> >> On Wed, Nov 23, 2016 at 12:04 PM, Nick Fisk  wrote:
> >> >> > Hi Daznis,
> >> >> >
> >> >> > I'm not sure how much help I can be, but I will try my best.
> >> >> >
> >> >> > I think the post-split stats error is probably benign, although
> >> >> > I think this suggests you also increased the number of PG's in
> >> >> > your cache pool? If so did you do this before or after you added
> >> >> > the
> >> >> extra OSD's?  This may have been the cause.
> >> >> >
> >> >> > On to the actual assert, this looks like it's part of the code
> >> >> > which trims the tiering hit set's. I don't understand why its
> >> >> > crashing out, but it must be related to an invalid or missing
> >> >> > hitset I would
> >> >> imagine.
> >> >> >
> >> >> > https://github.com/ceph/ceph/blob/v0.94.9/src/osd/ReplicatedPG.c
> >> >> > c#L
> >> >> > 104
> >> >> > 85
> >> >> >
> >> >> > The only thing I could think of from looking at in the code is
> >> >> > that the function loops through all hitsets that are above the
> >> >> > max number (hit_set_count). I wonder if setting this number
> >> >> > higher

Re: [ceph-users] KVM / Ceph performance problems

2016-11-23 Thread M. Piscaer
After changing the settings below. The linux guest has an good write speed.

But still the FreeNas guest stays on 10MB/s.

After doing some test on freebsd with with an bigger blocksize:
"dd if=/dev/zero of=testfile bs=9000" I get about 80MB/s.

With "dd if=/dev/zero of=testfile" the speed is 10MB/s.

What can I do?

Kind regards,

Michiel Piscaer




On 23-11-16 10:02, M. Piscaer wrote:
> Hi,
> 
> Thank you for your help.
> 
> After changing these settings the linux guest got en increase in speed.
> The FreeNAS guest still has an write speed of 10MB/s.
> 
> The disk driver is virtio and has en Write back cache.
> 
> What am I missing?
> 
> Kinds regards,
> 
> Michiel Piscaer
> 
> On 23-11-16 08:05, Оралов, Алексей С. wrote:
>> hello, Michiel
>>
>> Use hdd driver "virtio" and cache "Write back".
>>
>> And also on proxmox node add ceph client configuration:
>>
>> /etc/ceph/ceph.conf
>>
>> [client]
>> #admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok
>> rbd cache = true
>> rbd_cache_writethrough_until_flush = true
>> rbd_readahead_disable_after_bytes=0
>> rbd_default_format = 2
>>
>> #Tuning options
>> #rbd_cache_size = 67108864  #64M
>> #rbd_cache_max_dirty = 50331648  #48M
>> #rbd_cache_target_dirty = 33554432  #32M
>> #rbd_cache_max_dirty_age = 2
>> #rbd_op_threads = 10
>> #rbd_readahead_trigger_requests = 10
>>
>>
>> 23.11.2016 9:53, M. Piscaer пишет:
>>> Hi,
>>>
>>> I have an little performance problem with KVM and Ceph.
>>>
>>> I'm using Proxmox 4.3-10/7230e60f, with KVM version
>>> pve-qemu-kvm_2.7.0-8. Ceph is on version jewel 10.2.3 on both the
>>> cluster as the client (ceph-common).
>>>
>>> The systems are connected to the network via an 4x bonding with an total
>>> of 4 Gb/s.
>>>
>>> Within an guest,
>>> - when I do an write to I get about 10 MB/s.
>>> - Also when I try to do an write within the guest but then directly to
>>> ceph I get the same speed.
>>> - But when I mount an ceph object on the Proxmox host I get about 110MB/s
>>>
>>> The guest is connected to interface vmbr160 → bond0.160 → bond0.
>>>
>>> This bridge vmbr160 has an IP address with the same subnet as the ceph
>>> cluster with an mtu 9000.
>>>
>>> The KVM block device is an virtio device.
>>>
>>> What can I do to solve this problem?
>>>
>>> Kind regards,
>>>
>>> Michiel Piscaer
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>

-- 

E-mail:   mich...@digidiensten.nl
Telefoon: +31 77 7501700
Fax:  +31 77 7501701
Mobiel:   +31 6 16048782
Threema:  PBPCM9X3
PGP:  0x09F8706A
W3:   www.digidiensten.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph in an OSPF environment

2016-11-23 Thread Ansgar Jazdzewski
Hi,

we are planing to build a new datacenter with a layer3 routed network
under all our servers.
So each server will have a 17.16.x.y/32 ip and is anounced and shared
using OSPF with ECMP.

Now i try to install ceph on this nodes: and i got stuck because the
OSD-nodes can not reach the MON (ceph -s gives no output)

Ping is ok and telnet to the monport is also working.


default  proto bird  src 172.16.162.10
   nexthop via 192.168.1.1  dev eno1 weight 1
   nexthop via 192.168.2.1  dev eno2 weight 1
172.16.162.1  proto bird  src 172.16.162.10
   nexthop via 192.168.1.1  dev eno1 weight 1
   nexthop via 192.168.2.1  dev eno2 weight 1
172.16.162.11  proto bird  src 172.16.162.10
   nexthop via 192.168.1.11  dev eno1 weight 1
   nexthop via 192.168.2.11  dev eno2 weight 1
172.16.162.12  proto bird  src 172.16.162.10
   nexthop via 192.168.1.12  dev eno1 weight 1
   nexthop via 192.168.2.12  dev eno2 weight 1
172.16.162.13  proto bird  src 172.16.162.10
   nexthop via 192.168.1.13  dev eno1 weight 1
   nexthop via 192.168.2.13  dev eno2 weight 1



root@node-0001a:~/ceph/eu-fra# systemctl restart ceph-mon@node-0001a
root@node-0001a:~/ceph/eu-fra# netstat -ntpl | grep ceph-mon
tcp0  0 172.16.162.10:6789  0.0.0.0:*
LISTEN  6841/ceph-mon
#

So i am not sure what is the root cause for that behavior!

any ideas?

thanks,
Ansgar
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how possible is that ceph cluster crash

2016-11-23 Thread Samuel Just
Seems like that would be helpful.  I'm not really familiar with
ceph-disk though.
-Sam

On Wed, Nov 23, 2016 at 5:24 AM, Nick Fisk  wrote:
> Hi Sam,
>
> Would a check in ceph-disk for "nobarrier" in the osd_mount_options_{fstype} 
> variable be a good idea? It good either strip it out or
> fail to start the OSD unless an override flag is specified somewhere.
>
> Looking at ceph-disk code, I would imagine around here would be the right 
> place to put the check
> https://github.com/ceph/ceph/blob/master/src/ceph-disk/ceph_disk/main.py#L2642
>
> I don't mind trying to get this done if its felt to be worthwhile.
>
> Nick
>
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
>> Samuel Just
>> Sent: 19 November 2016 00:31
>> To: Nick Fisk 
>> Cc: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] how possible is that ceph cluster crash
>>
>> Many reasons:
>>
>> 1) You will eventually get a DC wide power event anyway at which point 
>> probably most of the OSDs will have hopelessly corrupted
>> internal xfs structures (yes, I have seen this happen to a poor soul with a 
>> DC with redundant power).
>> 2) Even in the case of a single rack/node power failure, the biggest danger 
>> isn't that the OSDs don't start.  It's that they *do
> start*, but
>> forgot or arbitrarily corrupted a random subset of transactions they told 
>> other osds and clients that they committed.  The exact
> impact
>> would be random, but for sure, any guarantees Ceph normally provides would 
>> be out the window.  RBD devices could have random
>> byte ranges zapped back in time (not great if they're the offsets assigned 
>> to your database or fs journal...) for instance.
>> 3) Deliberately powercycling a node counts as a power failure if you don't 
>> stop services and sync etc first.
>>
>> In other words, don't mess with the definition of "committing a transaction" 
>> if you value your data.
>> -Sam "just say no" Just
>>
>> On Fri, Nov 18, 2016 at 4:04 PM, Nick Fisk  wrote:
>> > Yes, because these things happen
>> >
>> > http://www.theregister.co.uk/2016/11/15/memset_power_cut_service_inter
>> > ruption/
>> >
>> > We had customers who had kit in this DC.
>> >
>> > To use your analogy, it's like crossing the road at traffic lights but
>> > not checking cars have stopped. You might be OK 99%of the time, but
>> > sooner or later it will bite you in the arse and it won't be pretty.
>> >
>> > 
>> > From: "Brian ::" 
>> > Sent: 18 Nov 2016 11:52 p.m.
>> > To: sj...@redhat.com
>> > Cc: Craig Chi; ceph-users@lists.ceph.com; Nick Fisk
>> > Subject: Re: [ceph-users] how possible is that ceph cluster crash
>> >
>> >> X-Assp-URIBLcache failed: '1e100.net'(black.uribl.com)
>> >> X-Assp-Spam-Level: *
>> >> X-Assp-Envelope-From: b...@iptel.co
>> >> X-Assp-Intended-For: n...@fisk.me.uk
>> >> X-Assp-ID: ASSP.fisk.me.uk (47951-11296)
>> >> X-Assp-Version: 1.9.1.4(1.0.00)
>> >>
>> >>
>> >> This is like your mother telling not to cross the road when you were
>> >> 4 years of age but not telling you it was because you could be
>> >> flattened by a car :)
>> >>
>> >> Can you expand on your answer? If you are in a DC with AB power,
>> >> redundant UPS, dual feed from the electric company, onsite
>> >> generators, dual PSU servers, is it still a bad idea?
>> >>
>> >>
>> >>
>> >>
>> >> On Fri, Nov 18, 2016 at 6:52 PM, Samuel Just  wrote:
>> >>>
>> >>> Never *ever* use nobarrier with ceph under *any* circumstances.  I
>> >>> cannot stress this enough.
>> >>> -Sam
>> >>>
>> >>> On Fri, Nov 18, 2016 at 10:39 AM, Craig Chi 
>> >>> wrote:
>> 
>>  Hi Nick and other Cephers,
>> 
>>  Thanks for your reply.
>> 
>> > 2) Config Errors
>> > This can be an easy one to say you are safe from. But I would say
>> > most outages and data loss incidents I have seen on the mailing
>> > lists have been due to poor hardware choice or configuring options
>> > such as size=2, min_size=1 or enabling stuff like nobarriers.
>> 
>> 
>>  I am wondering the pros and cons of the nobarrier option used by Ceph.
>> 
>>  It is well known that nobarrier is dangerous when power outage
>>  happens, but if we already have replicas in different racks or
>>  PDUs, will Ceph reduce the risk of data lost with this option?
>> 
>>  I have seen many performance tuning articles providing nobarrier
>>  option in xfs, but there are not many of then mention the trade-off
>>  of nobarrier.
>> 
>>  Is it really unacceptable to use nobarrier in production
>>  environment? I will be much grateful if you guys are willing to
>>  share any experiences about nobarrier and xfs.
>> 
>>  Sincerely,
>>  Craig Chi (Product Developer)
>>  Synology Inc. Taipei, Taiwan. Ext. 361
>> 
>>  On 2016-11-17 05:04, Nick Fisk  wrote:
>> 
>> > -Original Message-
>> > From: ceph-users [mailto:ceph-use

Re: [ceph-users] ceph cluster having blocke requests very frequently

2016-11-23 Thread David Turner
This thread has gotten quite large and I haven't read most of it, so I 
apologize if this is a duplicate idea/suggestion.  100% of the time our cluster 
has blocked requests and we aren't increasing pg_num, adding storage, or having 
a disk failing... it is pg subfolder splitting.  100% of the time, every time, 
this is our cause of blocked requests.  It is often accompanied by drives being 
marked down by the cluster even though the osd daemon is still running.

The settings that govern this are filestore merge threshold and filestore split 
multiple 
(http://docs.ceph.com/docs/giant/rados/configuration/filestore-config-ref/).



[cid:imagea09b7b.JPG@a034030d.498dc8ea]   David 
Turner | Cloud Operations Engineer | StorageCraft Technology 
Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943



If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.




From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Nick Fisk 
[n...@fisk.me.uk]
Sent: Wednesday, November 23, 2016 6:09 AM
To: 'Thomas Danan'; 'Peter Maloney'
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] ceph cluster having blocke requests very frequently

Hi Thomas,

I’m afraid I can’t offer anymore advice, they isn’t anything that I can see 
which could be the trigger. I know we spoke about downgrading the kernel, did 
you manage to try that?

Nick

From: Thomas Danan [mailto:thomas.da...@mycom-osi.com]
Sent: 23 November 2016 11:29
To: n...@fisk.me.uk; 'Peter Maloney' 
Cc: ceph-users@lists.ceph.com
Subject: RE: [ceph-users] ceph cluster having blocke requests very frequently

Hi all,

Still not able to find any explanation to this issue.

I recently tested the network and I am seeing some retransmit being done in the 
output of iperf. But overall the bandwidth durin a test of 10sec is around 7 to 
8 Gbps.
I was not sure to understand if it was the test itself who was overloading the 
network or if my network switches that were having an issue.
Switches have been checked and they are showing no congestion issues or other 
errors.

I really don’t know what to check or test, any idea is more than welcomed …

Thomas

From: Thomas Danan
Sent: vendredi 18 novembre 2016 17:12
To: 'n...@fisk.me.uk'; 'Peter Maloney'
Cc: ceph-users@lists.ceph.com
Subject: RE: [ceph-users] ceph cluster having blocke requests very frequently

Hi Nick,

Here are some logs. The system is in IST TZ and I have filtered the logs to get 
only 2 last hours during which we can observe the issue.

In that particular case, issue is illustrated with the following OSDs

Primary:
ID:607
PID:2962227
HOST:10.137.81.18

Secondary1
ID:528
PID:3721728
HOST:10.137.78.194

Secondary2
ID:771
PID:2806795
HOST:10.137.81.25

In that specific example, first slow request message is detected at 16:18

2016-11-18 16:18:51.991185 7f13acd8a700  0 log_channel(cluster) log [WRN] : 7 
slow requests, 7 included below; oldest blocked for > 30.521107 secs
2016-11-18 16:18:51.991213 7f13acd8a700  0 log_channel(cluster) log [WRN] : 
slow request 30.521107 seconds old, received at 2016-11-18 16:18:21.469965: 
osd_op(client.2406870.1:140440919 rbd_data.616bf2ae8944a.002b85a7 
[set-alloc-hint object_size 4194304 write_size 4194304,write 1449984~524288] 
0.4e69d0de snapc 218=[218,1fb,1df] ondisk+write e212564) currently waiting for 
subops from 528,771

I see that it is about replicating a 4MB Object with snapc context but in my 
environment I have no snapshot (actually they were all deleted). Also I was 
said those messages were not necessary related to object replication to 
snapshot image.
Each time I have a slow request message it is formatted as described with 4MB 
Object and snapc context

Rados df is showing me that I have 4 cloned objects, I do not understand why.

15 minutes later seems ops are unblocked after initiating reconnect message

2016-11-18 16:34:38.120918 7f13acd8a700  0 log_channel(cluster) log [WRN] : 
slow request 960.264008 seconds old, received at 2016-11-18 16:18:37.856850: 
osd_op(client.2406634.1:104826541 rbd_data.636fe2ae8944a.00111eec 
[set-alloc-hint object_size 4194304 write_size 4194304,write 4112384~4096] 
0.f56e90de snapc 426=[426,3f9] ondisk+write e212564) currently waiting for 
subops from 528,771
2016-11-18 16:34:46.863383 7f137705a700  0 -- 10.137.81.18:6840/2962227 >> 
10.137.81.135:0/748393319 pipe(0x293bd000 sd=35 :6840 s=0 pgs=0 cs=0 l=0 
c=0x21405020).accept peer addr is really 10.137.81.135:0/748393319 (socket is 
10.137.81.135:26749/0)
2016-11-18 16:35:05.048420 7f138fea6700  0 -- 192.168.228.36:6841/2962227 >> 
192.168.228.28:68

Re: [ceph-users] cephfs (rbd) read performance low - where is thebottleneck?

2016-11-23 Thread Mike Miller

JiaJia, all,

thanks, yes, I have the mount opts in mtab, and correct, if I leave out 
the "-v" option, no complaints.


mtab:
mounted ... type ceph (name=cephfs,rasize=134217728,key=client.cephfs)

It has to be rasize (rsize will not work).
One can check here:

cat /sys/class/bdi/ceph-*/read_ahead_kb
-> 131072

And YES! I am so happy, dd 40GB file does a lot more single thread now, 
much better.


rasize= 67108864  222 MB/s
rasize=134217728  360 MB/s
rasize=268435456  474 MB/s

Thank you all very much for bringing me on the right track, highly 
appreciated.


Regards,

Mike

On 11/23/16 5:55 PM, JiaJia Zhong wrote:

Mike,
if you run mount.ceph with "-v" options, you may get "ceph: Unknown
mount option rsize",
actually, you could ignore this, the rsize and rasize will both be
passed to mount syscall.

I belive that you have had the cephfs mounted successfully,
run "mount" in terminal to check the actual mount opts in mtab.

-- Original --
*From: * "Mike Miller";
*Date: * Wed, Nov 23, 2016 02:38 PM
*To: * "Eric Eastman";
*Cc: * "Ceph Users";
*Subject: * Re: [ceph-users] cephfs (rbd) read performance low - where
is thebottleneck?

Hi,

did some testing multithreaded access and dd, performance scales as it
should.

Any ideas to improve single threaded read performance further would be
highly appreciated. Some of our use cases requires that we need to read
large files by a single thread.

I have tried changing the readahead on the kernel client cephfs mount
too, rsize and rasize.

mount.ceph ... -o name=cephfs,secretfile=secret.key,rsize=67108864

Doing this on kernel 4.5.2 gives the error message:
"ceph: Unknown mount option rsize"
or unknown rasize.

Can someone explain to me how I can experiment with readahead on cephfs?

Mike

On 11/21/16 12:33 PM, Eric Eastman wrote:

Have you looked at your file layout?

On a test cluster running 10.2.3 I created a 5GB file and then looked
at the layout:

# ls -l test.dat
  -rw-r--r-- 1 root root 524288 Nov 20 23:09 test.dat
# getfattr -n ceph.file.layout test.dat
  # file: test.dat
  ceph.file.layout="stripe_unit=4194304 stripe_count=1
object_size=4194304 pool=cephfs_data"

From what I understand with this layout you are reading 4MB of data
from 1 OSD at a time so I think you are seeing the overall speed of a
single SATA drive.  I do not think increasing your MON/MDS links to
10Gb will help, nor for a single file read will it help by going to
SSD for the metadata.

To test this, you may want to try creating 10 x 50GB files, and then
read them in parallel and see if your overall throughput increases.
If so, take a look at the layout parameters and see if you can change
the file layout to get more parallelization.

https://github.com/ceph/ceph/blob/master/doc/dev/file-striping.rst
https://github.com/ceph/ceph/blob/master/doc/cephfs/file-layouts.rst

Regards,
Eric

On Sun, Nov 20, 2016 at 3:24 AM, Mike Miller 

wrote:

Hi,

reading a big file 50 GB (tried more too)

dd if=bigfile of=/dev/zero bs=4M

in a cluster with 112 SATA disks in 10 osd (6272 pgs, replication 3)

gives

me only about *122 MB/s* read speed in single thread. Scrubbing

turned off

during measurement.

I have been searching for possible bottlenecks. The network is not the
problem, the machine running dd is connected to the cluster public

network

with a 20 GBASE-T bond. osd dual network: cluster public 10 GBASE-T,

private

10 GBASE-T.

The osd SATA disks are utilized only up until about 10% or 20%, not more
than that. CPUs on osd idle too. CPUs on mon idle, mds usage about 1.0 (1
core is used on this 6-core machine). mon and mds connected with only

1 GbE

(I would expect some latency from that, but no bandwidth issues; in fact
network bandwidth is about 20 Mbit max).

If I read a file with 50 GB, then clear the cache on the reading machine
(but not the osd caches), I get much better reading performance of about
*620 MB/s*. That seems logical to me as much (most) of the data is

still in

the osd cache buffers. But still the read performance is not super
considered that the reading machine is connected to the cluster with a 20
Gbit/s bond.

How can I improve? I am not really sure, but from my understanding 2
possible bottlenecks come to mind:

1) 1 GbE connection to mon / mds

Is this the reason why reads are slow and osd disks are not hammered

by read

requests and therewith fully utilized?

2) Move metadata to SSD

Currently, cephfs_metadata is on the same pool as the data on the

spinning

SATA disks. Is this the bottleneck? Is the move of metadata to SSD a
solution?

Or is it both?

Your experience and insight are highly appreciated.

Thanks,

Mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cephfs mds failing to respond to capability release

2016-11-23 Thread Webert de Souza Lima
is it possible to count open file descriptors in cephfs only?

On Wed, Nov 16, 2016 at 2:12 PM Webert de Souza Lima 
wrote:

> I'm sorry, by server, I meant cluster.
> On one cluster the rate of files created and read is about 5 per second.
> On another cluster it's from 25 to 30 files created and read per second.
>
> On Wed, Nov 16, 2016 at 2:03 PM Webert de Souza Lima <
> webert.b...@gmail.com> wrote:
>
> Hello John.
>
> I'm sorry for the lack of information at the first post.
> The same version is in use for servers and clients.
>
> About the workload, it varies.
> On one server it's about *5 files created/written and then fully read per
> second*.
> On the other server it's about *5 to 6 times that number*, so a lot more,
> but the problem does not escalate at the same proportion.
>
> *~# ceph -v*
> ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>
> *~#dpkg -l | grep ceph*
> ii  ceph-fuse10.2.2-1trusty
> amd64FUSE-based client for the Ceph distributed file system
>
> Some things are worth mentioning:
> The service(1) that creates the file sends an async request to another
> service(2) that reads it.
> The service(1) that creates the file also deletes it when its client
> closes the connection, so it can do so while the other service(2) is
> trying to read it. i'm not sure what would happen here.
>
>
>
> On Wed, Nov 16, 2016 at 1:42 PM John Spray  wrote:
>
> On Wed, Nov 16, 2016 at 3:15 PM, Webert de Souza Lima
>  wrote:
> > hi,
> >
> > I have many clusters running cephfs, and in the last 45 days or so, 2 of
> > them started giving me the following message in ceph health:
> > mds0: Client dc1-mx02-fe02:guest failing to respond to capability release
> >
> > When this happens, cephfs stops responding. It will only get back after I
> > restart the failing mds.
> >
> > Algo, I get the following logs from ceph.log
> > https://paste.debian.net/896236/
> >
> > There was no change made that I can relate to this and I can't figure out
> > what is happening.
>
> I have the usual questions: what ceph versions, what clients etc
> (http://docs.ceph.com/docs/jewel/cephfs/early-adopters/#reporting-issues)
>
> Clients failing to respond to capability release are either buggy (old
> kernels?) or it's also possible that you have a workload that is
> holding an excessive number of files open.
>
> Cheers,
> John
>
>
>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph in an OSPF environment

2016-11-23 Thread Darrell Enns
As far as I am aware, there is no broadcast or multicast traffic involved (at 
least, I don't see any on my cluster). So there should be no issue with routing 
it over layer 3. Have you checked the following:

- name resolution working on all hosts
- firewall/acl rules
- selinux
- tcpdump the mon traffic (port 6789) to see that it's getting through
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph in an OSPF environment

2016-11-23 Thread Darrell Enns
You may also need to do something with the "public network" and/or "cluster 
network" options in ceph.conf.

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Darrell Enns
Sent: Wednesday, November 23, 2016 9:24 AM
To: ceph-us...@ceph.com
Subject: Re: [ceph-users] ceph in an OSPF environment

As far as I am aware, there is no broadcast or multicast traffic involved (at 
least, I don't see any on my cluster). So there should be no issue with routing 
it over layer 3. Have you checked the following:

- name resolution working on all hosts
- firewall/acl rules
- selinux
- tcpdump the mon traffic (port 6789) to see that it's getting through 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] new mon can't join new cluster, probe_timeout / probing

2016-11-23 Thread grin
Hello,

[This is hammer, 0.94.9, since proxmox waits for the new jewel release
due to some relevant fixes.]


This is possibly some network issue, but I cannot see the indicator
about what to see. mon0 usually stands in quorum alone, and other mons
cannot join. They get the monmap, they intend to join, but it just
never happens, mons get from synchronising to probing, forever. Raising
log level doesn't reveal anything to me.

cluster network and public network differs, and mons are supposed to be
on the public network. 

mon0:
2016-11-23 16:26:16.920691 7f8f193da700  1 mon.0@0(leader) e1  adding peer 
10.75.13.132:6789/0 to list of hints
2016-11-23 16:26:18.922057 7f8f193da700  1 mon.0@0(leader) e1  adding peer 
10.75.13.132:6789/0 to list of hints
2016-11-23 16:26:20.923695 7f8f193da700  1 mon.0@0(leader) e1  adding peer 
10.75.13.132:6789/0 to list of hints
2016-11-23 16:26:22.925172 7f8f193da700  1 mon.0@0(leader) e1  adding peer 
10.75.13.132:6789/0 to list of hints
...forever


mon1:
2016-11-23 16:25:14.887453 7fe81a87f880  0 ceph version 0.94.9 
(fe6d859066244b97b24f09d46552afc2071e6f90), process ceph-mon, pid 8956
2016-11-23 16:25:14.909873 7fe81a87f880  0 mon.1 does not exist in monmap, will 
attempt to join an existing cluster
2016-11-23 16:25:14.910934 7fe81a87f880  0 using public_addr 10.75.13.132:0/0 
-> 10.75.13.132:6789/0
2016-11-23 16:25:14.911012 7fe81a87f880  0 starting mon.1 rank -1 at 
10.75.13.132:6789/0 mon_data /var/lib/ceph/mon/ceph-1 fsid 
ca404f45-def3-4c22-a83b-7939e3f92514
2016-11-23 16:25:14.911406 7fe81a87f880  1 mon.1@-1(probing) e0 preinit fsid 
ca404f45-def3-4c22-a83b-7939e3f92514
2016-11-23 16:26:14.912255 7fe8137f8700  0 
mon.1@-1(synchronizing).data_health(0) update_stats avail 92% total 69923 MB, 
used 1552 MB, avail 64796 MB
2016-11-23 16:27:14.912613 7fe8137f8700  0 mon.1@-1(probing).data_health(0) 
update_stats avail 92% total 69923 MB, used 1552 MB, avail 64796 MB
2016-11-23 16:28:14.912868 7fe8137f8700  0 mon.1@-1(probing).data_health(0) 
update_stats avail 92% total 69923 MB, used 1552 MB, avail 64796 MB
...forever as well.



Raising on mon0:
2016-11-23 17:19:11.330786 7f8f20c60700 10 _calc_signature seq 366 front_crc_ = 
1411686358 middle_crc = 0 data_crc = 0 sig = 16063324873821844002
2016-11-23 17:19:11.330928 7f8f193da700 20 mon.0@0(leader) e1 have connection
2016-11-23 17:19:11.330937 7f8f193da700 20 mon.0@0(leader) e1 ms_dispatch 
existing session MonSession: mon.? 10.75.13.132:6789/0 is openallow * for mon.? 
10.75.13.132:6789/0
2016-11-23 17:19:11.330947 7f8f193da700 20 mon.0@0(leader) e1  caps allow *
2016-11-23 17:19:11.330953 7f8f193da700 20 is_capable service=mon command= read 
on cap allow *
2016-11-23 17:19:11.330956 7f8f193da700 20  allow so far , doing grant allow *
2016-11-23 17:19:11.330958 7f8f193da700 20  allow all
2016-11-23 17:19:11.330961 7f8f193da700 10 mon.0@0(leader) e1 handle_probe 
mon_probe(probe ca404f45-def3-4c22-a83b-7939e3f92514 name 1 new) v6
2016-11-23 17:19:11.330969 7f8f193da700 10 mon.0@0(leader) e1 
handle_probe_probe mon.? 10.75.13.132:6789/0mon_probe(probe 
ca404f45-def3-4c22-a83b-7939e3f92514 name 1 new) v6 features 55169095435288575
2016-11-23 17:19:11.331009 7f8f193da700  1 mon.0@0(leader) e1  adding peer 
10.75.13.132:6789/0 to list of hints
2016-11-23 17:19:11.331129 7f8f173d6700 10 _calc_signature seq 442670678 
front_crc_ = 1084090475 middle_crc = 0 data_crc = 0 sig = 15627235992780641097
2016-11-23 17:19:11.331164 7f8f173d6700 20 Putting signature in client 
message(seq # 442670678): sig = 15627235992780641097
2016-11-23 17:19:13.344756 7f8f20c60700 10 _calc_signature seq 367 front_crc_ = 
1411686358 middle_crc = 0 data_crc = 0 sig = 10295634500541529978
2016-11-23 17:19:13.344931 7f8f193da700 20 mon.0@0(leader) e1 have connection
2016-11-23 17:19:13.344940 7f8f193da700 20 mon.0@0(leader) e1 ms_dispatch 
existing session MonSession: mon.? 10.75.13.132:6789/0 is openallow * for mon.? 
10.75.13.132:6789/0
2016-11-23 17:19:13.344952 7f8f193da700 20 mon.0@0(leader) e1  caps allow *
2016-11-23 17:19:13.344959 7f8f193da700 20 is_capable service=mon command= read 
on cap allow *
2016-11-23 17:19:13.344962 7f8f193da700 20  allow so far , doing grant allow *
2016-11-23 17:19:13.344964 7f8f193da700 20  allow all
2016-11-23 17:19:13.344967 7f8f193da700 10 mon.0@0(leader) e1 handle_probe 
mon_probe(probe ca404f45-def3-4c22-a83b-7939e3f92514 name 1 new) v6
2016-11-23 17:19:13.344975 7f8f193da700 10 mon.0@0(leader) e1 
handle_probe_probe mon.? 10.75.13.132:6789/0mon_probe(probe 
ca404f45-def3-4c22-a83b-7939e3f92514 name 1 new) v6 features 55169095435288575
2016-11-23 17:19:13.345019 7f8f193da700  1 mon.0@0(leader) e1  adding peer 
10.75.13.132:6789/0 to list of hints


mon1 sometimes says like:
2016-11-23 17:06:04.241491 7f7c3f855700  0 -- 10.75.13.132:6789/0 >> 
10.75.13.131:6789/0 pipe(0x3ae4000 sd=13 :53558 s=2 pgs=106 cs=1 l=0 
c=0x3937600).reader missed message?  skipped from seq 0 to 64927996
2016-11-23 17:06:04.241620 7f7c

[ceph-users] Problems after upgrade to Jewel

2016-11-23 Thread Vincent Godin
Hello,

We had our cluster failed again this morning. It took almost the day to
stabilize.Here are some problems in OSD's logs we have encountered :

*Some OSDs refused to start :*

-1> 2016-11-23 15:50:49.507588 7f5f5b7a5800 -1 osd.27 196774 load_pgs: have
pgid 9.268 at epoch 196874, but missing map.  Crashing.

0> 2016-11-23 15:50:49.509473 7f5f5b7a5800 -1 osd/OSD.cc: In function 'void
OSD::load_pgs()' thread 7f5f5b7a5800 time 2016-11-23 15:50:49.507597
osd/OSD.cc: 3186: FAILED assert(0 == "Missing map in load_pgs")



ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)

1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0x7f5f5c1d35b5]

2: (OSD::load_pgs()+0x1f07) [0x7f5f5bb53b57]

3: (OSD::init()+0x2086) [0x7f5f5bb64e56]

4: (main()+0x2c55) [0x7f5f5bac8be5]

5: (__libc_start_main()+0xf5) [0x7f5f586b7b15]

6: (()+0x353009) [0x7f5f5bb13009]

NOTE: a copy of the executable, or `objdump -rdS ` is needed to
interpret this.



--- logging levels ---

   0/ 5 none

   0/ 1 lockdep

   0/ 1 context

   1/ 1 crush

   1/ 5 mds



We finaly arrived to start them by removing the PG without map of the OSD
when it was in "active+clean" state on the cluster. We used for this the
ceph-objectstore-tool


*Some OSDs who start but suicid after 3 mn* :



-5> 2016-11-23 15:32:28.488489 7fbe411ff700  5 osd.24 197525 heartbeat:
osd_stat(1883 GB used, 3703 GB avail, 5587 GB total, peers []/[] op hist [])

-4> 2016-11-23 15:32:30.188632 7fbe411ff700  5 osd.24 197525 heartbeat:
osd_stat(1883 GB used, 3703 GB avail, 5587 GB total, peers []/[] op hist [])

-3> 2016-11-23 15:32:32.678977 7fbe67ce3700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7fbe5ad4b700' had timed out after 60

-2> 2016-11-23 15:32:32.679010 7fbe67ce3700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7fbe5b54c700' had timed out after 60

-1> 2016-11-23 15:32:32.679016 7fbe67ce3700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7fbe5b54c700' had suicide timed out after 180

 0> 2016-11-23 15:32:32.680982 7fbe67ce3700 -1 common/HeartbeatMap.cc: In
function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*,
const char*, time_t)' thread 7fbe67ce3700 time 2016-11-23 15:32:32.679038
common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")



We have no explanation for this


*Some Dead-Locks* :



The OSD.32 refused to start because the PG 9.72 has
no map


-1> 2016-11-23 15:02:32.675283 7f2b74492800 -1
osd.32 196921 load_pgs: have pgid 9.72 at epoch 196975, but missing map.
Crashing.

0> 2016-11-23 15:02:32.676710 7f2b74492800 -1 osd/OSD.cc: In function 'void
OSD::load_pgs()' thread 7f2b74492800 time 2016-11-23 15:02:32.675293
osd/OSD.cc: 3186: FAILED assert(0 == "Missing map in load_pgs")



PG 9.72 is in state « down+peering » and waiting
for OSD.32 to start or to be set "lost"



We have to declare to OSD lost because of these deadlocks



*Some messages in log we'd like to have an explanation :*



2016-11-23 15:02:32.202200 7f2b74492800  0 set uid:gid to 167:167
(ceph:ceph)

2016-11-23 15:02:32.202240 7f2b74492800  0 ceph version 10.2.2
(45107e21c568dd033c2f0a3107dec8f0b0e58374), process ceph-osd, pid 1718781

2016-11-23 15:02:32.203557 7f2b74492800  0 pidfile_write: ignore empty
--pid-file

2016-11-23 15:02:32.231376 7f2b74492800  0
filestore(/var/lib/ceph/osd/ceph-32) backend xfs (magic 0x58465342)

2016-11-23 15:02:32.231935 7f2b74492800  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-32) detect_features: FIEMAP
ioctl is disabled via 'filestore fiemap' config option

2016-11-23 15:02:32.231941 7f2b74492800  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-32) detect_features:
SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option

2016-11-23 15:02:32.231961 7f2b74492800  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-32) detect_features: splice
is supported

2016-11-23 15:02:32.232777 7f2b74492800  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-32) detect_features:
syncfs(2) syscall fully supported (by glibc and kernel)

2016-11-23 15:02:32.232824 7f2b74492800  0
xfsfilestorebackend(/var/lib/ceph/osd/ceph-32) detect_feature: extsize is
disabled by conf

2016-11-23 15:02:32.233704 7f2b74492800  1 leveldb: Recovering log #102027

2016-11-23 15:02:32.234863 7f2b74492800  1 leveldb: Delete type=3 #102026



2016-11-23 15:02:32.234926 7f2b74492800  1 leveldb: Delete type=0 #102027



2016-11-23 15:02:32.235444 7f2b74492800  0
filestore(/var/lib/ceph/osd/ceph-32) mount: enabling WRITEAHEAD journal
mode: checkpoint is not enabled

2016-11-23 15:02:32.237484 7f2b74492800  1 journal _open
/var/lib/ceph/osd/ceph-32/journal fd 18: 5368709120 bytes, block size 4096
bytes, directio = 1, aio = 1

2016-11-23 15:02:32.238027 7f2b74492800  1 journal _open
/var/lib/ceph/osd/ceph-32/journal fd 18: 5368709120 bytes, block size 4096
bytes, directio = 1, aio = 1

2016-11-23 15:02:32.238

Re: [ceph-users] osd set noin ignored for old OSD ids

2016-11-23 Thread Gregory Farnum
On Tue, Nov 22, 2016 at 7:56 PM, Adrian Saul
 wrote:
>
> Hi ,
>  As part of migration between hardware I have been building new OSDs and 
> cleaning up old ones  (osd rm osd.x, osd crush rm osd.x, auth del osd.x).   
> To try and prevent rebalancing kicking in until all the new OSDs are created 
> on a host I use "ceph osd set noin", however what I have seen is that if the 
> new OSD that is created uses a new unique ID, then the flag is honoured and 
> the OSD remains out until I bring it in.  However if the OSD re-uses a 
> previous OSD id then it will go straight to in and start backfilling.  I have 
> to manually out the OSD to stop it (or set nobackfill,norebalance).
>
> Am I doing something wrong in this process or is there something about "noin" 
> that is ignored for previously existing OSDs that have been removed from both 
> the OSD map and crush map?

There are a lot of different pieces of an OSD ID that need to get
deleted for it to be truly gone; my guess is you've missed some of
those. The noin flag doesn't prevent unlinked-but-up CRUSH entries
from getting placed back into the tree, etc.

We may also have a bug though, so if you can demonstrate that the ID
doesn't exist in the CRUSH and OSD dumps then please create a ticket
at tracker.ceph.com!
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cluster having blocke requests very frequently

2016-11-23 Thread Tomasz Kuzemko
Hi Thomas,

do you have any RBD created as clone from another snapshot? If yes then
this would mean you still have some protected snapshots and only way to get
rid of them is to flatten the cloned RBD, unprotect snapshot and delete it.

2016-11-18 12:42 GMT+01:00 Thomas Danan :

> Hi Nick,
>
>
>
> Here are some logs. The system is in IST TZ and I have filtered the logs
> to get only 2 last hours during which we can observe the issue.
>
>
>
> In that particular case, issue is illustrated with the following OSDs
>
>
>
> Primary:
>
> ID:607
>
> PID:2962227
>
> HOST:10.137.81.18
>
>
>
> Secondary1
>
> ID:528
>
> PID:3721728
>
> HOST:10.137.78.194
>
>
>
> Secondary2
>
> ID:771
>
> PID:2806795
>
> HOST:10.137.81.25
>
>
>
> In that specific example, first slow request message is detected at 16:18
>
>
>
> 2016-11-18 16:18:51.991185 7f13acd8a700  0 log_channel(cluster) log [WRN]
> : 7 slow requests, 7 included below; oldest blocked for > 30.521107 secs
>
> 2016-11-18 16:18:51.991213 7f13acd8a700  0 log_channel(cluster) log [WRN]
> : slow request 30.521107 seconds old, received at 2016-11-18
> 16:18:21.469965: osd_op(client.2406870.1:140440919 
> rbd_data.616bf2ae8944a.002b85a7
> [set-alloc-hint object_size 4194304 write_size 4194304,write
> 1449984~524288] 0.4e69d0de snapc 218=[218,1fb,1df] ondisk+write e212564)
> currently waiting for subops from 528,771
>
>
>
> I see that it is about replicating a 4MB Object with snapc context but in
> my environment I have no snapshot (actually they were all deleted). Also I
> was said those messages were not necessary related to object replication to
> snapshot image.
>
> Each time I have a slow request message it is formatted as described with
> 4MB Object and snapc context
>
>
>
> Rados df is showing me that I have 4 cloned objects, I do not understand
> why.
>
>
>
> 15 minutes later seems ops are unblocked after initiating reconnect message
>
>
>
> 2016-11-18 16:34:38.120918 7f13acd8a700  0 log_channel(cluster) log [WRN]
> : slow request 960.264008 seconds old, received at 2016-11-18
> 16:18:37.856850: osd_op(client.2406634.1:104826541 
> rbd_data.636fe2ae8944a.00111eec
> [set-alloc-hint object_size 4194304 write_size 4194304,write 4112384~4096]
> 0.f56e90de snapc 426=[426,3f9] ondisk+write e212564) currently waiting for
> subops from 528,771
>
> 2016-11-18 16:34:46.863383 7f137705a700  0 -- 10.137.81.18:6840/2962227
> >> 10.137.81.135:0/748393319 pipe(0x293bd000 sd=35 :6840 s=0 pgs=0 cs=0
> l=0 c=0x21405020).accept peer addr is really 10.137.81.135:0/748393319
> (socket is 10.137.81.135:26749/0)
>
> 2016-11-18 16:35:05.048420 7f138fea6700  0 -- 192.168.228.36:6841/2962227
> >> 192.168.228.28:6805/3721728 pipe(0x1271b000 sd=34 :50711 s=2 pgs=647
> cs=5 l=0 c=0x42798c0).fault, initiating reconnect
>
>
>
> I do not manage to identify anything obvious in the logs.
>
>
>
> Thanks for your help …
>
>
>
> Thomas
>
>
>
>
>
> *From:* Nick Fisk [mailto:n...@fisk.me.uk]
> *Sent:* jeudi 17 novembre 2016 11:02
> *To:* Thomas Danan; n...@fisk.me.uk; 'Peter Maloney'
> *Cc:* ceph-users@lists.ceph.com
> *Subject:* RE: [ceph-users] ceph cluster having blocke requests very
> frequently
>
>
>
> Hi Thomas,
>
>
>
> Do you have the OSD logs from around the time of that slow request (13:12
> to 13:29 period)?
>
>
>
> Do you also see anything about OSD’s going down in the Mon ceph.log file
> around that time?
>
>
>
> 480 seconds is probably far too long for a disk to be busy for, I’m
> wondering if the OSD is either dying and respawning or if you are running
> out of some type of system resource….eg TCP connections or something like
> that, which means the OSD’s can’t communicate with each other.
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com
> ] *On Behalf Of *Thomas Danan
> *Sent:* 17 November 2016 08:59
> *To:* n...@fisk.me.uk; 'Peter Maloney'  >
> *Cc:* ceph-users@lists.ceph.com
> *Subject:* Re: [ceph-users] ceph cluster having blocke requests very
> frequently
>
>
>
> Hi,
>
>
>
> I have recheck the pattern when slow request are detected.
>
>
>
> I have an example with following (primary: 411, secondary: 176, 594)
>
> On primary slow requests detected: waiting for subops (176, 594)  during
> 16 minutes 
>
>
>
> 2016-11-17 13:29:27.209754 7f001d414700 0 log_channel(cluster) log [WRN] :
> 7 slow requests, 7 included below; oldest blocked for > 480.477315 secs
>
> 2016-11-17 13:29:27.209777 7f001d414700 0 log_channel(cluster) log [WRN] :
> slow request 480.477315 seconds old, received at 2016-11-17
> 13:21:26.732303: osd_op(client.2407558.1:206455044 
> rbd_data.66ea12ae8944a.001acbbc
> [set-alloc-hint object_size 4194304 write_size 4194304,write
> 1257472~368640] 0.61fe279f snapc 3fd=[3fd,3de] ondisk+write e210553)
> currently waiting for subops from 176,594
>
>
>
> So the primary OSD is waiting for subops since 13:21 (13:29 - 480 seconds)
>
>
>
> 2016-11-17 13:36:33.039691 7efffd8ee700 0 -- 192.168.228.23:6800/694486
> >> 192.168.228.7:6819/36118

[ceph-users] How are replicas spread in default crush configuration?

2016-11-23 Thread Kevin Olbrich
Hi,

just to make sure, as I did not find a reference in the docs:
Are replicas spread across hosts or "just" OSDs?

I am using a 5 OSD cluster (4 pools, 128 pgs each) with size = 2. Currently
each OSD is a ZFS backed storage array.
Now I installed a server which is planned to host 4x OSDs (and setting size
to 3).

I want to make sure we can resist two offline hosts (in terms of hardware).
Is my assumption correct?

Mit freundlichen Grüßen / best regards,
Kevin Olbrich.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] degraded objects after osd add

2016-11-23 Thread Kevin Olbrich
Hi,

what happens when size = 2 and some objects are in degraded state?
This sounds like easy data loss when the old but active OSD fails while
recovery is in progress?

It would make more sense to have the pg replicate first and then remove the
PG from the old OSD.

Mit freundlichen Grüßen / best regards,
Kevin Olbrich.

>
>  Original Message 
> Subject: Re: [ceph-users] degraded objects after osd add (17-Nov-2016 9:14)
> From:Burkhard Linke 
> To:  c...@dolphin-it.de
>
> Hi,
>
>
> On 11/17/2016 08:07 AM, Steffen Weißgerber wrote:
> > Hello,
> >
> > just for understanding:
> >
> > When starting to fill osd's with data due to setting the weigth from 0
> to the normal value
> > the ceph status displays degraded objects (>0.05%).
> >
> > I don't understand the reason for this because there's no storage
> revoekd from the cluster,
> > only added. Therefore only the displayed object displacement makes sense.
> If you just added a new OSD, a number of PGs will be backfilling or
> waiting for backfilling (the remapped ones). I/O to these PGs is not
> blocked, and thus object may be modified. AFAIK these objects show up as
> degraded.
>
> I'm not sure how ceph handles these objects, e.g. whether it writes them
> to the old OSDs assigned to the PG, or whether they are put on the new OSD
> already, even if the corresponding PG is waiting for backfilling.
>
> Nonetheless the degraded objects will be cleaned up during backfilling.
>
> Regards,
> Burkhard
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Listing out the available namespace in the Ceph Cluster

2016-11-23 Thread David Zafman


Hi Janmejay,

Sorry I just found you e-mail in my inbox.

There is no list namespaces, but rather you can list all objects in all 
namespaces using the --all option and filter the results.


I created 10 namespaces (ns1 - ns10) in addition to the default one.

rados -p testpool --all ls --format=json | jq '.[].namespace' | sort -u

""
"ns1"
"ns10"
"ns2"
"ns3"
"ns4"
"ns5"
"ns6"
"ns7"
"ns8"
"ns9"

David


On 11/15/16 12:28 AM, Janmejay Baral wrote:

*Dear Mr. David,*

I have been using Ceph since 1.5 yrs. Now recently we have upgraded our
Ceph cluster to Jewel from Hammer. Now for the testing purpose I need to
check with all the available Namespaces.  Can you please help me with the
query to find out the the Namespaces we have created ?





*Thanks & Regards,Janmejay Baral*


*(Actiance India Pvt. Ltd.)Sr. Software Engineer+91-9739741384*



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How are replicas spread in default crush configuration?

2016-11-23 Thread Chris Taylor
 

Kevin, 

After changing the pool size to 3, make sure the min_size is set to 1 to
allow 2 of the 3 hosts to be offline. 

http://docs.ceph.com/docs/master/rados/operations/pools/#set-pool-values
[2] 

How many MONs do you have and are they on the same OSD hosts? If you
have 3 MONs running on the OSD hosts and two go offline, you will not
have a quorum of MONs and I/O will be blocked. 

I would also check your CRUSH map. I believe you want to make sure your
rules have "step chooseleaf firstn 0 type host" and not "... type osd"
so that replicas are on different hosts. I have not had to make that
change before so you will want to read up on it first. Don't take my
word for it. 

http://docs.ceph.com/docs/master/rados/operations/crush-map/#crush-map-parameters
[3] 

Hope that helps. 

Chris 

On 2016-11-23 1:32 pm, Kevin Olbrich wrote: 

> Hi, 
> 
> just to make sure, as I did not find a reference in the docs: 
> Are replicas spread across hosts or "just" OSDs? 
> 
> I am using a 5 OSD cluster (4 pools, 128 pgs each) with size = 2. Currently 
> each OSD is a ZFS backed storage array. 
> Now I installed a server which is planned to host 4x OSDs (and setting size 
> to 3). 
> 
> I want to make sure we can resist two offline hosts (in terms of hardware). 
> Is my assumption correct? 
> 
> Mit freundlichen Grüßen / best regards,
> Kevin Olbrich. 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [1]
 

Links:
--
[1] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[2]
http://docs.ceph.com/docs/master/rados/operations/pools/#set-pool-values
[3]
http://docs.ceph.com/docs/master/rados/operations/crush-map/#crush-map-parameters___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How are replicas spread in default crush configuration?

2016-11-23 Thread Samuel Just
On Wed, Nov 23, 2016 at 4:11 PM, Chris Taylor  wrote:
> Kevin,
>
> After changing the pool size to 3, make sure the min_size is set to 1 to
> allow 2 of the 3 hosts to be offline.

If you do this, the flip side is that while in that configuration
losing that single
host will render your data unrecoverable (writes were only witnessed by that
osd).

>
> http://docs.ceph.com/docs/master/rados/operations/pools/#set-pool-values
>
> How many MONs do you have and are they on the same OSD hosts? If you have 3
> MONs running on the OSD hosts and two go offline, you will not have a quorum
> of MONs and I/O will be blocked.
>
> I would also check your CRUSH map. I believe you want to make sure your
> rules have "step chooseleaf firstn 0 type host" and not "... type osd" so
> that replicas are on different hosts. I have not had to make that change
> before so you will want to read up on it first. Don't take my word for it.
>
> http://docs.ceph.com/docs/master/rados/operations/crush-map/#crush-map-parameters
>
> Hope that helps.
>
>
>
> Chris
>
>
>
> On 2016-11-23 1:32 pm, Kevin Olbrich wrote:
>
> Hi,
>
> just to make sure, as I did not find a reference in the docs:
> Are replicas spread across hosts or "just" OSDs?
>
> I am using a 5 OSD cluster (4 pools, 128 pgs each) with size = 2. Currently
> each OSD is a ZFS backed storage array.
> Now I installed a server which is planned to host 4x OSDs (and setting size
> to 3).
>
> I want to make sure we can resist two offline hosts (in terms of hardware).
> Is my assumption correct?
>
> Mit freundlichen Grüßen / best regards,
> Kevin Olbrich.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] export-diff behavior if an initial snapshot is NOT specified

2016-11-23 Thread Zhongyan Gu
Thank you Jason. My test shows in the following case, image B will be
exactly same:
1. clone image A from parent:
#rbd clone 1124-parent@snap1 A

2. create snap for A
#rbd snap create A@snap1

3. create empty image B
#rbd create B -s 1

4. export-diff A then impor-diff B:
#rbd export-diff A@snap1 -|./rbd import-diff - B

5. check A@snap1 equals B@snap1
#rbd export A@snap1 -|md5sum
Exporting image: 100% complete...done.
880709d7352b6c9926beb1d829673366  -
#rbd export B@snap1 -|md5sum
Exporting image: 100% complete...done.
880709d7352b6c9926beb1d829673366  -
output shows A@snap1 equals B@snap1

However, in the following case, image B will not be exactly same:
1. clone image A from parent:
#rbd clone 1124-parent@snap1 A

2. create snap for A
#rbd snap create A@snap1

3. use fio make some change to A

4. create empty image B
#rbd create B -s 1

4. export-diff A then impor-diff B:
#rbd export-diff A@snap1 -|./rbd import-diff - B

5. check A@snap1 equals B@snap1
#rbd export A@snap1 -|md5sum
Exporting image: 100% complete...done.
880709d7352b6c9926beb1d829673366  -
#rbd export B@snap1 -|md5sum
Exporting image: 100% complete...done.
bbf7cf69a84f3978c66f5eb082fb91ec  -
output shows A@snap1 DOES NOT equal B@snap1

The second case can always be reproduced. What is wrong with the second
case?

Thanks,
Zhongyan


On Wed, Nov 23, 2016 at 10:11 PM, Jason Dillaman 
wrote:

> What you are seeing sounds like a side-effect of deep-flatten support.
> If you write to an unallocated extent within a cloned image, the
> associated object extent must be read from the parent image, modified,
> and written to the clone image.
>
> Since the Infernalis release, this process has been tweaked if the
> cloned image has a snapshot. In that case, the associated object
> extent is still read from the parent, but instead of being modified
> and written to the HEAD revision, it is left unmodified and is written
> to "pre" snapshot history followed by writing the original
> modification (w/o the parent's object extent data) to the HEAD
> revision.
>
> This change to the IO path was made to support flattening clones and
> dissociating them from their parents even if the clone had snapshots.
>
> Therefore, what you are seeing with export-diff is actually the
> backing object extent of data from the parent image written to the
> clone's "pre" snapshot history. If you had two snapshots and your
> export-diff'ed from the first to second snapshot, you wouldn't see
> this extra data.
>
> To your question about how to prepare image B to make sure it will be
> exactly the same, the answer is that you don't need to do anything. In
> your example above, I am assuming you are manually creating an empty
> Image B and using "import-diff" to populate it. The difference in the
> export-diff is most likely related to fact that the clone lost its
> sparseness on any backing object that was written (e.g. instead of a
> one or more 512 byte diffs within a backing object extent, you will
> see a single, full-object extent with zeroes where the parent image
> had no data).
>
>
> On Wed, Nov 23, 2016 at 5:06 AM, Zhongyan Gu 
> wrote:
> > Let me make the issue more clear.
> > Suppose I cloned image A from a parent image and create snap1 for image A
> > and  then make some change of image A.
> > If I did the rbd export-diff @snap1. how should I prepare the existing
> image
> > B to make sure it  will be exactly same with image A@snap1 after
> import-diff
> > against this image B.
> >
> > Thanks,
> > Zhongyan
> >
> >
> > On Wed, Nov 23, 2016 at 11:34 AM, Zhongyan Gu 
> wrote:
> >>
> >> Thanks Jason, very clear explanation.
> >> However, I found some strange behavior when export-diff on a cloned
> image,
> >> not sure it is a bug on calc_snap_set_diff().
> >> The test is,
> >> Image A is cloned from a parent image. then create snap1 for image A.
> >> The content of export-diff A@snap1 will be changed when update image A.
> >> Only after image A has no overlap with parent, the content of
> export-diff
> >> A@snap1 is stabled, which is almost zero.
> >> I don't think it is a designed behavior. export-diff A@snap1 should
> always
> >> get a stable output no matter image A is cloned or not.
> >>
> >> Please correct me if anything wrong.
> >>
> >> Thanks,
> >> Zhongyan
> >>
> >>
> >>
> >>
> >> On Tue, Nov 22, 2016 at 10:31 PM, Jason Dillaman 
> >> wrote:
> >>>
> >>> On Tue, Nov 22, 2016 at 5:31 AM, Zhongyan Gu 
> >>> wrote:
> >>> > So if initial snapshot is NOT specified, then:
> >>> > rbd export-diff image@snap1 will diff all data to snap1. this cmd
> >>> > equals to
> >>> > :
> >>> > rbd export image@snap1. Is my understand right or not??
> >>>
> >>>
> >>> While they will both export all data associated w/ image@snap1, the
> >>> "export" command will generate a raw, non-sparse dump of the full
> >>> image whereas "export-diff" will export only sections of the image
> >>> that contain data. The file generated from "export" can be used with
> >>> the "import" command to crea

[ceph-users] how to get the default CRUSH map that should be generated by ceph itself ?

2016-11-23 Thread JiaJia Zhong
hi, folksIs there any way I could get the original(default) crush map after 
some manual mofifications?  


eg:
assuming no one had tuned the crush map, the map state named ORIGINAL , 
which was maintained by ceph itself.
if I modified the crush map,  map state named COMMIT1,
the I added a new osd,  ceph would add it to the crush map,  at this thime 
map state named COMMIT2,


now, how to get the default CRUSH map that should be generated by ceph 
itself? I mean something like ORIGINAL+COMMIT2.___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] export-diff behavior if an initial snapshot is NOT specified

2016-11-23 Thread Zhongyan Gu
BTW, I used Hammer 0.94.5 to do the test.

Zhongyan

On Thu, Nov 24, 2016 at 10:07 AM, Zhongyan Gu  wrote:

> Thank you Jason. My test shows in the following case, image B will be
> exactly same:
> 1. clone image A from parent:
> #rbd clone 1124-parent@snap1 A
>
> 2. create snap for A
> #rbd snap create A@snap1
>
> 3. create empty image B
> #rbd create B -s 1
>
> 4. export-diff A then impor-diff B:
> #rbd export-diff A@snap1 -|./rbd import-diff - B
>
> 5. check A@snap1 equals B@snap1
> #rbd export A@snap1 -|md5sum
> Exporting image: 100% complete...done.
> 880709d7352b6c9926beb1d829673366  -
> #rbd export B@snap1 -|md5sum
> Exporting image: 100% complete...done.
> 880709d7352b6c9926beb1d829673366  -
> output shows A@snap1 equals B@snap1
>
> However, in the following case, image B will not be exactly same:
> 1. clone image A from parent:
> #rbd clone 1124-parent@snap1 A
>
> 2. create snap for A
> #rbd snap create A@snap1
>
> 3. use fio make some change to A
>
> 4. create empty image B
> #rbd create B -s 1
>
> 4. export-diff A then impor-diff B:
> #rbd export-diff A@snap1 -|./rbd import-diff - B
>
> 5. check A@snap1 equals B@snap1
> #rbd export A@snap1 -|md5sum
> Exporting image: 100% complete...done.
> 880709d7352b6c9926beb1d829673366  -
> #rbd export B@snap1 -|md5sum
> Exporting image: 100% complete...done.
> bbf7cf69a84f3978c66f5eb082fb91ec  -
> output shows A@snap1 DOES NOT equal B@snap1
>
> The second case can always be reproduced. What is wrong with the second
> case?
>
> Thanks,
> Zhongyan
>
>
> On Wed, Nov 23, 2016 at 10:11 PM, Jason Dillaman 
> wrote:
>
>> What you are seeing sounds like a side-effect of deep-flatten support.
>> If you write to an unallocated extent within a cloned image, the
>> associated object extent must be read from the parent image, modified,
>> and written to the clone image.
>>
>> Since the Infernalis release, this process has been tweaked if the
>> cloned image has a snapshot. In that case, the associated object
>> extent is still read from the parent, but instead of being modified
>> and written to the HEAD revision, it is left unmodified and is written
>> to "pre" snapshot history followed by writing the original
>> modification (w/o the parent's object extent data) to the HEAD
>> revision.
>>
>> This change to the IO path was made to support flattening clones and
>> dissociating them from their parents even if the clone had snapshots.
>>
>> Therefore, what you are seeing with export-diff is actually the
>> backing object extent of data from the parent image written to the
>> clone's "pre" snapshot history. If you had two snapshots and your
>> export-diff'ed from the first to second snapshot, you wouldn't see
>> this extra data.
>>
>> To your question about how to prepare image B to make sure it will be
>> exactly the same, the answer is that you don't need to do anything. In
>> your example above, I am assuming you are manually creating an empty
>> Image B and using "import-diff" to populate it. The difference in the
>> export-diff is most likely related to fact that the clone lost its
>> sparseness on any backing object that was written (e.g. instead of a
>> one or more 512 byte diffs within a backing object extent, you will
>> see a single, full-object extent with zeroes where the parent image
>> had no data).
>>
>>
>> On Wed, Nov 23, 2016 at 5:06 AM, Zhongyan Gu 
>> wrote:
>> > Let me make the issue more clear.
>> > Suppose I cloned image A from a parent image and create snap1 for image
>> A
>> > and  then make some change of image A.
>> > If I did the rbd export-diff @snap1. how should I prepare the existing
>> image
>> > B to make sure it  will be exactly same with image A@snap1 after
>> import-diff
>> > against this image B.
>> >
>> > Thanks,
>> > Zhongyan
>> >
>> >
>> > On Wed, Nov 23, 2016 at 11:34 AM, Zhongyan Gu 
>> wrote:
>> >>
>> >> Thanks Jason, very clear explanation.
>> >> However, I found some strange behavior when export-diff on a cloned
>> image,
>> >> not sure it is a bug on calc_snap_set_diff().
>> >> The test is,
>> >> Image A is cloned from a parent image. then create snap1 for image A.
>> >> The content of export-diff A@snap1 will be changed when update image
>> A.
>> >> Only after image A has no overlap with parent, the content of
>> export-diff
>> >> A@snap1 is stabled, which is almost zero.
>> >> I don't think it is a designed behavior. export-diff A@snap1 should
>> always
>> >> get a stable output no matter image A is cloned or not.
>> >>
>> >> Please correct me if anything wrong.
>> >>
>> >> Thanks,
>> >> Zhongyan
>> >>
>> >>
>> >>
>> >>
>> >> On Tue, Nov 22, 2016 at 10:31 PM, Jason Dillaman 
>> >> wrote:
>> >>>
>> >>> On Tue, Nov 22, 2016 at 5:31 AM, Zhongyan Gu 
>> >>> wrote:
>> >>> > So if initial snapshot is NOT specified, then:
>> >>> > rbd export-diff image@snap1 will diff all data to snap1. this cmd
>> >>> > equals to
>> >>> > :
>> >>> > rbd export image@snap1. Is my understand right or not??
>> >>>
>> >>>
>> >>> While they will both ex

Re: [ceph-users] [EXTERNAL] Re: osd set noin ignored for old OSD ids

2016-11-23 Thread Will . Boege
>From my experience noin doesn't stop new OSDs from being marked in. noin only 
>works on OSDs already in the crushmap. To accomplish the behavior you want 
>I've injected "mon osd auto mark new in = false" into MONs. This also seems to 
>set their OSD weight to 0 when they are created. 

> On Nov 23, 2016, at 1:47 PM, Gregory Farnum  wrote:
> 
> On Tue, Nov 22, 2016 at 7:56 PM, Adrian Saul
>  wrote:
>> 
>> Hi ,
>> As part of migration between hardware I have been building new OSDs and 
>> cleaning up old ones  (osd rm osd.x, osd crush rm osd.x, auth del osd.x).   
>> To try and prevent rebalancing kicking in until all the new OSDs are created 
>> on a host I use "ceph osd set noin", however what I have seen is that if the 
>> new OSD that is created uses a new unique ID, then the flag is honoured and 
>> the OSD remains out until I bring it in.  However if the OSD re-uses a 
>> previous OSD id then it will go straight to in and start backfilling.  I 
>> have to manually out the OSD to stop it (or set nobackfill,norebalance).
>> 
>> Am I doing something wrong in this process or is there something about 
>> "noin" that is ignored for previously existing OSDs that have been removed 
>> from both the OSD map and crush map?
> 
> There are a lot of different pieces of an OSD ID that need to get
> deleted for it to be truly gone; my guess is you've missed some of
> those. The noin flag doesn't prevent unlinked-but-up CRUSH entries
> from getting placed back into the tree, etc.
> 
> We may also have a bug though, so if you can demonstrate that the ID
> doesn't exist in the CRUSH and OSD dumps then please create a ticket
> at tracker.ceph.com!
> -Greg
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [EXTERNAL] Re: osd set noin ignored for old OSD ids

2016-11-23 Thread Adrian Saul

Thanks - that is more in line with what I was looking for, being able to 
suppress backfills/rebalancing until a host/hosts full set of OSDs are up and 
ready.


> -Original Message-
> From: Will.Boege [mailto:will.bo...@target.com]
> Sent: Thursday, 24 November 2016 2:17 PM
> To: Gregory Farnum
> Cc: Adrian Saul; ceph-users@lists.ceph.com
> Subject: Re: [EXTERNAL] Re: [ceph-users] osd set noin ignored for old OSD
> ids
>
> From my experience noin doesn't stop new OSDs from being marked in. noin
> only works on OSDs already in the crushmap. To accomplish the behavior you
> want I've injected "mon osd auto mark new in = false" into MONs. This also
> seems to set their OSD weight to 0 when they are created.
>
> > On Nov 23, 2016, at 1:47 PM, Gregory Farnum 
> wrote:
> >
> > On Tue, Nov 22, 2016 at 7:56 PM, Adrian Saul
> >  wrote:
> >>
> >> Hi ,
> >> As part of migration between hardware I have been building new OSDs
> and cleaning up old ones  (osd rm osd.x, osd crush rm osd.x, auth del osd.x).
> To try and prevent rebalancing kicking in until all the new OSDs are created
> on a host I use "ceph osd set noin", however what I have seen is that if the
> new OSD that is created uses a new unique ID, then the flag is honoured and
> the OSD remains out until I bring it in.  However if the OSD re-uses a 
> previous
> OSD id then it will go straight to in and start backfilling.  I have to 
> manually out
> the OSD to stop it (or set nobackfill,norebalance).
> >>
> >> Am I doing something wrong in this process or is there something about
> "noin" that is ignored for previously existing OSDs that have been removed
> from both the OSD map and crush map?
> >
> > There are a lot of different pieces of an OSD ID that need to get
> > deleted for it to be truly gone; my guess is you've missed some of
> > those. The noin flag doesn't prevent unlinked-but-up CRUSH entries
> > from getting placed back into the tree, etc.
> >
> > We may also have a bug though, so if you can demonstrate that the ID
> > doesn't exist in the CRUSH and OSD dumps then please create a ticket
> > at tracker.ceph.com!
> > -Greg
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cluster having blocke requests very frequently

2016-11-23 Thread Thomas Danan
Hi Tomasz,

For some reasons, I am still having Snapshot context even if no more snapshot 
is available in my system. Probably because not all objects have been deleted 
(I still see 4 object clones).

However, I was said that from the logs I was showing, that we are not trying to 
replicate 4MB objects but just few bytes of that object so probably not related 
to Snapshots (cf write_size 4194304,write 1449984~524288)
2016-11-18 16:18:51.991213 7f13acd8a700  0 log_channel(cluster) log [WRN] : 
slow request 30.521107 seconds old, received at 2016-11-18 16:18:21.469965: 
osd_op(client.2406870.1:140440919 rbd_data.616bf2ae8944a.002b85a7 
[set-alloc-hint object_size 4194304 write_size 4194304,write 1449984~524288] 
0.4e69d0de snapc 218=[218,1fb,1df] ondisk+write e212564) currently waiting for 
subops from 528,771
Thomas

From: Tomasz Kuzemko [mailto:tom...@kuzemko.net]
Sent: jeudi 24 novembre 2016 01:42
To: Thomas Danan
Cc: n...@fisk.me.uk; Peter Maloney; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] ceph cluster having blocke requests very frequently

Hi Thomas,
do you have any RBD created as clone from another snapshot? If yes then this 
would mean you still have some protected snapshots and only way to get rid of 
them is to flatten the cloned RBD, unprotect snapshot and delete it.

2016-11-18 12:42 GMT+01:00 Thomas Danan 
mailto:thomas.da...@mycom-osi.com>>:
Hi Nick,

Here are some logs. The system is in IST TZ and I have filtered the logs to get 
only 2 last hours during which we can observe the issue.

In that particular case, issue is illustrated with the following OSDs

Primary:
ID:607
PID:2962227
HOST:10.137.81.18

Secondary1
ID:528
PID:3721728
HOST:10.137.78.194

Secondary2
ID:771
PID:2806795
HOST:10.137.81.25

In that specific example, first slow request message is detected at 16:18

2016-11-18 16:18:51.991185 7f13acd8a700  0 log_channel(cluster) log [WRN] : 7 
slow requests, 7 included below; oldest blocked for > 30.521107 secs
2016-11-18 16:18:51.991213 7f13acd8a700  0 log_channel(cluster) log [WRN] : 
slow request 30.521107 seconds old, received at 2016-11-18 16:18:21.469965: 
osd_op(client.2406870.1:140440919 rbd_data.616bf2ae8944a.002b85a7 
[set-alloc-hint object_size 4194304 write_size 4194304,write 1449984~524288] 
0.4e69d0de snapc 218=[218,1fb,1df] ondisk+write e212564) currently waiting for 
subops from 528,771

I see that it is about replicating a 4MB Object with snapc context but in my 
environment I have no snapshot (actually they were all deleted). Also I was 
said those messages were not necessary related to object replication to 
snapshot image.
Each time I have a slow request message it is formatted as described with 4MB 
Object and snapc context

Rados df is showing me that I have 4 cloned objects, I do not understand why.

15 minutes later seems ops are unblocked after initiating reconnect message

2016-11-18 16:34:38.120918 7f13acd8a700  0 log_channel(cluster) log [WRN] : 
slow request 960.264008 seconds old, received at 2016-11-18 16:18:37.856850: 
osd_op(client.2406634.1:104826541 rbd_data.636fe2ae8944a.00111eec 
[set-alloc-hint object_size 4194304 write_size 4194304,write 4112384~4096] 
0.f56e90de snapc 426=[426,3f9] ondisk+write e212564) currently waiting for 
subops from 528,771
2016-11-18 16:34:46.863383 7f137705a700  0 -- 
10.137.81.18:6840/2962227 >> 
10.137.81.135:0/748393319 pipe(0x293bd000 
sd=35 :6840 s=0 pgs=0 cs=0 l=0 c=0x21405020).accept peer addr is really 
10.137.81.135:0/748393319 (socket is 
10.137.81.135:26749/0)
2016-11-18 16:35:05.048420 7f138fea6700  0 -- 
192.168.228.36:6841/2962227 >> 
192.168.228.28:6805/3721728 pipe(0x1271b000 
sd=34 :50711 s=2 pgs=647 cs=5 l=0 c=0x42798c0).fault, initiating reconnect

I do not manage to identify anything obvious in the logs.

Thanks for your help …

Thomas


From: Nick Fisk [mailto:n...@fisk.me.uk]
Sent: jeudi 17 novembre 2016 11:02
To: Thomas Danan; n...@fisk.me.uk; 'Peter Maloney'
Cc: ceph-users@lists.ceph.com
Subject: RE: [ceph-users] ceph cluster having blocke requests very frequently

Hi Thomas,

Do you have the OSD logs from around the time of that slow request (13:12 to 
13:29 period)?

Do you also see anything about OSD’s going down in the Mon ceph.log file around 
that time?

480 seconds is probably far too long for a disk to be busy for, I’m wondering 
if the OSD is either dying and respawning or if you are running out of some 
type of system resource….eg TCP connections or something like that, which means 
the OSD’s can’t communicate with each other.

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Thomas 
Danan
Sent: 17 November 2016 08:59
To: n...@fisk.me.uk

Re: [ceph-users] [EXTERNAL] Re: ceph in an OSPF environment

2016-11-23 Thread Will . Boege
Check your MTU. I think ospf has issues when fragmenting. Try setting your 
interface MTU to something obnoxiously small to ensure that anything upstream 
isn't fragmenting - say 1200. If it works try a saner value like 1496 which 
accounts for any vlan headers. 

If you're running in a spine/leaf you might just want to consider segregating 
Ceph replication traffic by interface and not network.  

I'd also be interested in seeing any reference arch around Ceph in spine leaf 
that anyone has implemented. 

> On Nov 23, 2016, at 11:29 AM, Darrell Enns  wrote:
> 
> You may also need to do something with the "public network" and/or "cluster 
> network" options in ceph.conf.
> 
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Darrell Enns
> Sent: Wednesday, November 23, 2016 9:24 AM
> To: ceph-us...@ceph.com
> Subject: Re: [ceph-users] ceph in an OSPF environment
> 
> As far as I am aware, there is no broadcast or multicast traffic involved (at 
> least, I don't see any on my cluster). So there should be no issue with 
> routing it over layer 3. Have you checked the following:
> 
> - name resolution working on all hosts
> - firewall/acl rules
> - selinux
> - tcpdump the mon traffic (port 6789) to see that it's getting through 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph in an OSPF environment

2016-11-23 Thread Ansgar Jazdzewski
Hi,

we are planing to build a new datacenter with a layer3 routed network
under all our servers.
So each server will have a 17.16.x.y/32 ip and is anounced and shared
using OSPF with ECMP.

Now i try to install ceph on this nodes: and i got stuck because the
OSD-nodes can not reach the MON (ceph -s gives no output)

Ping is ok and telnet to the monport is also working.


default  proto bird  src 172.16.162.10
   nexthop via 192.168.1.1  dev eno1 weight 1
   nexthop via 192.168.2.1  dev eno2 weight 1
172.16.162.1  proto bird  src 172.16.162.10
   nexthop via 192.168.1.1  dev eno1 weight 1
   nexthop via 192.168.2.1  dev eno2 weight 1
172.16.162.11  proto bird  src 172.16.162.10
   nexthop via 192.168.1.11  dev eno1 weight 1
   nexthop via 192.168.2.11  dev eno2 weight 1
172.16.162.12  proto bird  src 172.16.162.10
   nexthop via 192.168.1.12  dev eno1 weight 1
   nexthop via 192.168.2.12  dev eno2 weight 1
172.16.162.13  proto bird  src 172.16.162.10
   nexthop via 192.168.1.13  dev eno1 weight 1
   nexthop via 192.168.2.13  dev eno2 weight 1



root@node-0001a:~/ceph/eu-fra# systemctl restart ceph-mon@node-0001a
root@node-0001a:~/ceph/eu-fra# netstat -ntpl | grep ceph-mon
tcp0  0 172.16.162.10:6789  0.0.0.0:*
LISTEN  6841/ceph-mon
#

So i am not sure what is the root cause for that behavior!

any ideas?

thanks,
Ansgar
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cluster having blocke requests very frequently

2016-11-23 Thread Thomas Danan
Hi David,

Actually this pg subfolder splitting was not explored yet, I will have a look. 
In our setup OSD are never marked as down, probably because of the following 
settings:

mon_osd_adjust_heartbeat_grace = false
mon_osd_adjust_down_out_interval = false
mon_osd_min_down_reporters = 5
mon_osd_min_down_reports = 10

Thomas

From: David Turner [mailto:david.tur...@storagecraft.com]
Sent: mercredi 23 novembre 2016 21:27
To: n...@fisk.me.uk; Thomas Danan; 'Peter Maloney'
Cc: ceph-users@lists.ceph.com
Subject: RE: [ceph-users] ceph cluster having blocke requests very frequently

This thread has gotten quite large and I haven't read most of it, so I 
apologize if this is a duplicate idea/suggestion.  100% of the time our cluster 
has blocked requests and we aren't increasing pg_num, adding storage, or having 
a disk failing... it is pg subfolder splitting.  100% of the time, every time, 
this is our cause of blocked requests.  It is often accompanied by drives being 
marked down by the cluster even though the osd daemon is still running.

The settings that govern this are filestore merge threshold and filestore split 
multiple 
(http://docs.ceph.com/docs/giant/rados/configuration/filestore-config-ref/).

[cid:image001.jpg@01D24656.1AB7CA70]

David Turner | Cloud Operations Engineer | StorageCraft Technology 
Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943


If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.



From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Nick Fisk 
[n...@fisk.me.uk]
Sent: Wednesday, November 23, 2016 6:09 AM
To: 'Thomas Danan'; 'Peter Maloney'
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] ceph cluster having blocke requests very frequently
Hi Thomas,

I’m afraid I can’t offer anymore advice, they isn’t anything that I can see 
which could be the trigger. I know we spoke about downgrading the kernel, did 
you manage to try that?

Nick

From: Thomas Danan [mailto:thomas.da...@mycom-osi.com]
Sent: 23 November 2016 11:29
To: n...@fisk.me.uk; 'Peter Maloney' 
mailto:peter.malo...@brockmann-consult.de>>
Cc: ceph-users@lists.ceph.com
Subject: RE: [ceph-users] ceph cluster having blocke requests very frequently

Hi all,

Still not able to find any explanation to this issue.

I recently tested the network and I am seeing some retransmit being done in the 
output of iperf. But overall the bandwidth durin a test of 10sec is around 7 to 
8 Gbps.
I was not sure to understand if it was the test itself who was overloading the 
network or if my network switches that were having an issue.
Switches have been checked and they are showing no congestion issues or other 
errors.

I really don’t know what to check or test, any idea is more than welcomed …

Thomas

From: Thomas Danan
Sent: vendredi 18 novembre 2016 17:12
To: 'n...@fisk.me.uk'; 'Peter Maloney'
Cc: ceph-users@lists.ceph.com
Subject: RE: [ceph-users] ceph cluster having blocke requests very frequently

Hi Nick,

Here are some logs. The system is in IST TZ and I have filtered the logs to get 
only 2 last hours during which we can observe the issue.

In that particular case, issue is illustrated with the following OSDs

Primary:
ID:607
PID:2962227
HOST:10.137.81.18

Secondary1
ID:528
PID:3721728
HOST:10.137.78.194

Secondary2
ID:771
PID:2806795
HOST:10.137.81.25

In that specific example, first slow request message is detected at 16:18

2016-11-18 16:18:51.991185 7f13acd8a700  0 log_channel(cluster) log [WRN] : 7 
slow requests, 7 included below; oldest blocked for > 30.521107 secs
2016-11-18 16:18:51.991213 7f13acd8a700  0 log_channel(cluster) log [WRN] : 
slow request 30.521107 seconds old, received at 2016-11-18 16:18:21.469965: 
osd_op(client.2406870.1:140440919 rbd_data.616bf2ae8944a.002b85a7 
[set-alloc-hint object_size 4194304 write_size 4194304,write 1449984~524288] 
0.4e69d0de snapc 218=[218,1fb,1df] ondisk+write e212564) currently waiting for 
subops from 528,771

I see that it is about replicating a 4MB Object with snapc context but in my 
environment I have no snapshot (actually they were all deleted). Also I was 
said those messages were not necessary related to object replication to 
snapshot image.
Each time I have a slow request message it is formatted as described with 4MB 
Object and snapc context

Rados df is showing me that I have 4 cloned objects, I do not understand why.

15 minutes later seems ops are unblocked after initiating reconnect message

2016-11