Re: [ceph-users] Intel power tuning - 30% throughput performance increase

2017-05-04 Thread Blair Bethwaite
Sounds good, but could also have a config option to set it before dropping
root?

On 4 May 2017 20:28, "Brad Hubbard"  wrote:

On Thu, May 4, 2017 at 10:58 AM, Haomai Wang  wrote:
> refer to https://github.com/ceph/ceph/pull/5013

How about we issue a warning about possible performance implications
if we detect this is not set to 1 *or* 0 at startup?

>
> On Thu, May 4, 2017 at 7:56 AM, Brad Hubbard  wrote:
>> +ceph-devel to get input on whether we want/need to check the value of
>> /dev/cpu_dma_latency (platform dependant) at startup and issue a
>> warning, or whether documenting this would suffice?
>>
>> Any doc contribution would be welcomed.
>>
>> On Wed, May 3, 2017 at 7:18 PM, Blair Bethwaite
>>  wrote:
>>> On 3 May 2017 at 19:07, Dan van der Ster  wrote:
 Whether cpu_dma_latency should be 0 or 1, I'm not sure yet. I assume
 your 30% boost was when going from throughput-performance to
 dma_latency=0, right? I'm trying to understand what is the incremental
 improvement from 1 to 0.
>>>
>>> Probably minimal given that represents a state transition latency
>>> taking only 1us. Presumably the main issue is when the CPU can drop
>>> into the lower states and the compounding impact of that over time. I
>>> will do some simple characterisation of that over the next couple of
>>> weeks and report back...
>>>
>>> --
>>> Cheers,
>>> ~Blairo
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> --
>> Cheers,
>> Brad
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Cheers,
Brad
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Limit bandwidth on RadosGW?

2017-05-04 Thread hrchu
Yes, but I have not yet found an open source reverse proxy can achieve it.
haproxy blocks requests instead of limit bandwidth in a fixed Mbps; nginx can
only limit the download speed (by the option proxy_limit_rate), and has a
negative side effect that it buffers the response body, it cause huge
performance downgrade.

On Fri, May 5, 2017 at 6:04 AM, Robin H. Johnson  wrote:

> On Thu, May 04, 2017 at 04:35:21PM +0800, hrchu wrote:
> > Thanks for reply.
> >
> > tc can only do limit on interfaces or given IPs, but what I am talking
> > about is "per connection", e.g.,  each put object could be 5MB/s, get
> > object could be 1MB/s.
> To achieve your required level of control, you need haproxy, or other
> HTTP-aware reverse proxy, as to have a different limit based on the
> operation (and possibly the access key).
>
> --
> Robin Hugh Johnson
> Gentoo Linux: Dev, Infra Lead, Foundation Trustee & Treasurer
> E-Mail   : robb...@gentoo.org
> GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
> GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Limit bandwidth on RadosGW?

2017-05-04 Thread hrchu
According to the link you provided, haproxy seems can only block requests
instead of limit bandwidth.


If people violate the rate limiting in this example they are redirected to
the backend ease-up-y0 which gives them 503 error page that can be
customized.



On Fri, May 5, 2017 at 7:38 AM, George Mihaiescu 
wrote:

> Terminate the connections on haproxy which  is great for ssl as well, and
> use these instructions to set qos per connection and data transferred:
> http://blog.serverfault.com/2010/08/26/1016491873/
>
>
>
>
> On May 4, 2017, at 04:35, hrchu  wrote:
>
> Thanks for reply.
>
> tc can only do limit on interfaces or given IPs, but what I am talking
> about is "per connection", e.g.,  each put object could be 5MB/s, get
> object could be 1MB/s.
>
> Correct me if anything wrong.
>
>
> Regards,
>
> Chu, Hua-Rong (曲華榮), +886-3-4227151 #57968 <+886%203%20422%207151>
> Networklab, Computer Science & Information Engineering,
> National Central University, Jhongli, Taiwan R.O.C.
>
> On Thu, May 4, 2017 at 4:01 PM, Marc Roos 
> wrote:
>
>>
>>
>>
>> No experience with it. But why not use linux for it? Maybe this solution
>> on every RGW is sufficient, I cannot imagine you need 3rd party for
>> this.
>>
>> https://unix.stackexchange.com/questions/28198/how-to-limit-
>> network-bandwidth
>> https://wiki.archlinux.org/index.php/Advanced_traffic_control
>>
>>
>>
>> -Original Message-
>> From: hrchu [mailto:petertc@gmail.com]
>> Sent: donderdag 4 mei 2017 9:24
>> To: Ceph Users
>> Subject: [ceph-users] Limit bandwidth on RadosGW?
>>
>> Hi all,
>> I want to limit RadosGW per connection upload/download speed for QoS.
>> There is no build-in option for this, so maybe a 3rd party reverse proxy
>> in front of Radosgw is needed. Does anyone have experience about this?
>>
>>
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intel power tuning - 30% throughput performance increase

2017-05-04 Thread Brad Hubbard
On Thu, May 4, 2017 at 10:58 AM, Haomai Wang  wrote:
> refer to https://github.com/ceph/ceph/pull/5013

How about we issue a warning about possible performance implications
if we detect this is not set to 1 *or* 0 at startup?

>
> On Thu, May 4, 2017 at 7:56 AM, Brad Hubbard  wrote:
>> +ceph-devel to get input on whether we want/need to check the value of
>> /dev/cpu_dma_latency (platform dependant) at startup and issue a
>> warning, or whether documenting this would suffice?
>>
>> Any doc contribution would be welcomed.
>>
>> On Wed, May 3, 2017 at 7:18 PM, Blair Bethwaite
>>  wrote:
>>> On 3 May 2017 at 19:07, Dan van der Ster  wrote:
 Whether cpu_dma_latency should be 0 or 1, I'm not sure yet. I assume
 your 30% boost was when going from throughput-performance to
 dma_latency=0, right? I'm trying to understand what is the incremental
 improvement from 1 to 0.
>>>
>>> Probably minimal given that represents a state transition latency
>>> taking only 1us. Presumably the main issue is when the CPU can drop
>>> into the lower states and the compounding impact of that over time. I
>>> will do some simple characterisation of that over the next couple of
>>> weeks and report back...
>>>
>>> --
>>> Cheers,
>>> ~Blairo
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> --
>> Cheers,
>> Brad
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Limit bandwidth on RadosGW?

2017-05-04 Thread George Mihaiescu
Terminate the connections on haproxy which  is great for ssl as well, and use 
these instructions to set qos per connection and data transferred:
http://blog.serverfault.com/2010/08/26/1016491873/


 

> On May 4, 2017, at 04:35, hrchu  wrote:
> 
> Thanks for reply.
> 
> tc can only do limit on interfaces or given IPs, but what I am talking about 
> is "per connection", e.g.,  each put object could be 5MB/s, get object could 
> be 1MB/s.
> 
> Correct me if anything wrong.
> 
> 
> Regards,
> 
> Chu, Hua-Rong (曲華榮), +886-3-4227151 #57968
> Networklab, Computer Science & Information Engineering,
> National Central University, Jhongli, Taiwan R.O.C.
> 
>> On Thu, May 4, 2017 at 4:01 PM, Marc Roos  wrote:
>> 
>> 
>> 
>> No experience with it. But why not use linux for it? Maybe this solution
>> on every RGW is sufficient, I cannot imagine you need 3rd party for
>> this.
>> 
>> https://unix.stackexchange.com/questions/28198/how-to-limit-network-bandwidth
>> https://wiki.archlinux.org/index.php/Advanced_traffic_control
>> 
>> 
>> 
>> -Original Message-
>> From: hrchu [mailto:petertc@gmail.com]
>> Sent: donderdag 4 mei 2017 9:24
>> To: Ceph Users
>> Subject: [ceph-users] Limit bandwidth on RadosGW?
>> 
>> Hi all,
>> I want to limit RadosGW per connection upload/download speed for QoS.
>> There is no build-in option for this, so maybe a 3rd party reverse proxy
>> in front of Radosgw is needed. Does anyone have experience about this?
>> 
>> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Limit bandwidth on RadosGW?

2017-05-04 Thread Robin H. Johnson
On Thu, May 04, 2017 at 04:35:21PM +0800, hrchu wrote:
> Thanks for reply.
> 
> tc can only do limit on interfaces or given IPs, but what I am talking
> about is "per connection", e.g.,  each put object could be 5MB/s, get
> object could be 1MB/s.
To achieve your required level of control, you need haproxy, or other
HTTP-aware reverse proxy, as to have a different limit based on the
operation (and possibly the access key).

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Trustee & Treasurer
E-Mail   : robb...@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Extremely high OSD memory utilization on Kraken 11.2.0 (with XFS -or- bluestore)

2017-05-04 Thread Sage Weil
Hi Aaron-

Sorry, lost track of this one.  In order to get backtraces out of the core 
you need the matching executables.  Can you make sure the ceph-osd-dbg or 
ceph-debuginfo package is installed on the machine (depending on if it's 
deb or rpm) and then gdb ceph-osd corefile and 'thr app all bt'?

Thanks!
sage


On Thu, 4 May 2017, Aaron Ten Clay wrote:

> Were the backtraces we obtained not useful? Is there anything else we
> can try to get the OSDs up again?
> 
> On Wed, Apr 19, 2017 at 4:18 PM, Aaron Ten Clay  wrote:
> > I'm new to doing this all via systemd and systemd-coredump, but I appear to
> > have gotten cores from two OSD processes. When xzipped they are < 2MIB each,
> > but I threw them on my webserver to avoid polluting the mailing list. This
> > seems oddly small, so if I've botched the process somehow let me know :)
> >
> > https://aarontc.com/ceph/dumps/core.ceph-osd.150.082e9ca887c34cfbab183366a214a84c.6742.1492634493.xz
> > https://aarontc.com/ceph/dumps/core.ceph-osd.150.082e9ca887c34cfbab183366a214a84c.7202.1492634508.xz
> >
> > And for reference:
> > root@osd001:/var/lib/systemd/coredump# ceph -v
> > ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7)
> >
> >
> > I am also investigating sysdig as recommended.
> >
> > Thanks!
> > -Aaron
> >
> >
> > On Mon, Apr 17, 2017 at 8:15 AM, Sage Weil  wrote:
> >>
> >> On Sat, 15 Apr 2017, Aaron Ten Clay wrote:
> >> > Hi all,
> >> >
> >> > Our cluster is experiencing a very odd issue and I'm hoping for some
> >> > guidance on troubleshooting steps and/or suggestions to mitigate the
> >> > issue.
> >> > tl;dr: Individual ceph-osd processes try to allocate > 90GiB of RAM and
> >> > are
> >> > eventually nuked by oom_killer.
> >>
> >> My guess is that there there is a bug in a decoding path and it's
> >> trying to allocate some huge amount of memory.  Can you try setting a
> >> memory ulimit to something like 40gb and then enabling core dumps so you
> >> can get a core?  Something like
> >>
> >> ulimit -c unlimited
> >> ulimit -m 2000
> >>
> >> or whatever the corresponding systemd unit file options are...
> >>
> >> Once we have a core file it will hopefully be clear who is
> >> doing the bad allocation...
> >>
> >> sage
> >>
> >>
> >>
> >> >
> >> > I'll try to explain the situation in detail:
> >> >
> >> > We have 24-4TB bluestore HDD OSDs, and 4-600GB SSD OSDs. The SSD OSDs
> >> > are in
> >> > a different CRUSH "root", used as a cache tier for the main storage
> >> > pools,
> >> > which are erasure coded and used for cephfs. The OSDs are spread across
> >> > two
> >> > identical machines with 128GiB of RAM each, and there are three monitor
> >> > nodes on different hardware.
> >> >
> >> > Several times we've encountered crippling bugs with previous Ceph
> >> > releases
> >> > when we were on RC or betas, or using non-recommended configurations, so
> >> > in
> >> > January we abandoned all previous Ceph usage, deployed LTS Ubuntu 16.04,
> >> > and
> >> > went with stable Kraken 11.2.0 with the configuration mentioned above.
> >> > Everything was fine until the end of March, when one day we find all but
> >> > a
> >> > couple of OSDs are "down" inexplicably. Investigation reveals oom_killer
> >> > came along and nuked almost all the ceph-osd processes.
> >> >
> >> > We've gone through a bunch of iterations of restarting the OSDs, trying
> >> > to
> >> > bring them up one at a time gradually, all at once, various
> >> > configuration
> >> > settings to reduce cache size as suggested in this ticket:
> >> > http://tracker.ceph.com/issues/18924...
> >> >
> >> > I don't know if that ticket really pertains to our situation or not, I
> >> > have
> >> > no experience with memory allocation debugging. I'd be willing to try if
> >> > someone can point me to a guide or walk me through the process.
> >> >
> >> > I've even tried, just to see if the situation was  transitory, adding
> >> > over
> >> > 300GiB of swap to both OSD machines. The OSD procs managed to allocate,
> >> > in a
> >> > matter of 5-10 minutes, more than 300GiB of RAM pressure and became
> >> > oom_killer victims once again.
> >> >
> >> > No software or hardware changes took place around the time this problem
> >> > started, and no significant data changes occurred either. We added about
> >> > 40GiB of ~1GiB files a week or so before the problem started and that's
> >> > the
> >> > last time data was written.
> >> >
> >> > I can only assume we've found another crippling bug of some kind, this
> >> > level
> >> > of memory usage is entirely unprecedented. What can we do?
> >> >
> >> > Thanks in advance for any suggestions.
> >> > -Aaron
> >> >
> >> >
> >
> >
> >
> >
> > --
> > Aaron Ten Clay
> > https://aarontc.com
> 
> 
> 
> -- 
> Aaron Ten Clay
> https://aarontc.com
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com

[ceph-users] RS vs LRC - abnormal results

2017-05-04 Thread Oleg Kolosov
Hi,
I'm comparing different configurations of LRC with Reed-Solomon.
Specifically I'm comparing the total data read in all OSDs during a
reconstruction of a single node (I drop a single OSD and measure until the
system is stable again).
While most of the configurations output the desired result, a certain
configuration stood out.
The expectation in the following configuration is that RS would generate
40% more reads, however LRC generated 8% more.

RS configuration: k=4, m=6
LRC configuration:

sudo ceph osd erasure-code-profile set myprofile \
plugin=lrc \
mapping=DD_DD_ \
layers='[
[ "DD_DD_", "" ],
[ "DDc___", "" ],
[ "___DDc", "" ],
]' \
ruleset-steps='[
[ "chooseleaf", "osd",  10  ],
]'

RS will always require 4 nodes for reconstruction, however LRC would use
less (4 in worst case and 2 in best).

Any thoughts on this matter?

Thanks,
Oleg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Extremely high OSD memory utilization on Kraken 11.2.0 (with XFS -or- bluestore)

2017-05-04 Thread Aaron Ten Clay
Were the backtraces we obtained not useful? Is there anything else we
can try to get the OSDs up again?

On Wed, Apr 19, 2017 at 4:18 PM, Aaron Ten Clay  wrote:
> I'm new to doing this all via systemd and systemd-coredump, but I appear to
> have gotten cores from two OSD processes. When xzipped they are < 2MIB each,
> but I threw them on my webserver to avoid polluting the mailing list. This
> seems oddly small, so if I've botched the process somehow let me know :)
>
> https://aarontc.com/ceph/dumps/core.ceph-osd.150.082e9ca887c34cfbab183366a214a84c.6742.1492634493.xz
> https://aarontc.com/ceph/dumps/core.ceph-osd.150.082e9ca887c34cfbab183366a214a84c.7202.1492634508.xz
>
> And for reference:
> root@osd001:/var/lib/systemd/coredump# ceph -v
> ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7)
>
>
> I am also investigating sysdig as recommended.
>
> Thanks!
> -Aaron
>
>
> On Mon, Apr 17, 2017 at 8:15 AM, Sage Weil  wrote:
>>
>> On Sat, 15 Apr 2017, Aaron Ten Clay wrote:
>> > Hi all,
>> >
>> > Our cluster is experiencing a very odd issue and I'm hoping for some
>> > guidance on troubleshooting steps and/or suggestions to mitigate the
>> > issue.
>> > tl;dr: Individual ceph-osd processes try to allocate > 90GiB of RAM and
>> > are
>> > eventually nuked by oom_killer.
>>
>> My guess is that there there is a bug in a decoding path and it's
>> trying to allocate some huge amount of memory.  Can you try setting a
>> memory ulimit to something like 40gb and then enabling core dumps so you
>> can get a core?  Something like
>>
>> ulimit -c unlimited
>> ulimit -m 2000
>>
>> or whatever the corresponding systemd unit file options are...
>>
>> Once we have a core file it will hopefully be clear who is
>> doing the bad allocation...
>>
>> sage
>>
>>
>>
>> >
>> > I'll try to explain the situation in detail:
>> >
>> > We have 24-4TB bluestore HDD OSDs, and 4-600GB SSD OSDs. The SSD OSDs
>> > are in
>> > a different CRUSH "root", used as a cache tier for the main storage
>> > pools,
>> > which are erasure coded and used for cephfs. The OSDs are spread across
>> > two
>> > identical machines with 128GiB of RAM each, and there are three monitor
>> > nodes on different hardware.
>> >
>> > Several times we've encountered crippling bugs with previous Ceph
>> > releases
>> > when we were on RC or betas, or using non-recommended configurations, so
>> > in
>> > January we abandoned all previous Ceph usage, deployed LTS Ubuntu 16.04,
>> > and
>> > went with stable Kraken 11.2.0 with the configuration mentioned above.
>> > Everything was fine until the end of March, when one day we find all but
>> > a
>> > couple of OSDs are "down" inexplicably. Investigation reveals oom_killer
>> > came along and nuked almost all the ceph-osd processes.
>> >
>> > We've gone through a bunch of iterations of restarting the OSDs, trying
>> > to
>> > bring them up one at a time gradually, all at once, various
>> > configuration
>> > settings to reduce cache size as suggested in this ticket:
>> > http://tracker.ceph.com/issues/18924...
>> >
>> > I don't know if that ticket really pertains to our situation or not, I
>> > have
>> > no experience with memory allocation debugging. I'd be willing to try if
>> > someone can point me to a guide or walk me through the process.
>> >
>> > I've even tried, just to see if the situation was  transitory, adding
>> > over
>> > 300GiB of swap to both OSD machines. The OSD procs managed to allocate,
>> > in a
>> > matter of 5-10 minutes, more than 300GiB of RAM pressure and became
>> > oom_killer victims once again.
>> >
>> > No software or hardware changes took place around the time this problem
>> > started, and no significant data changes occurred either. We added about
>> > 40GiB of ~1GiB files a week or so before the problem started and that's
>> > the
>> > last time data was written.
>> >
>> > I can only assume we've found another crippling bug of some kind, this
>> > level
>> > of memory usage is entirely unprecedented. What can we do?
>> >
>> > Thanks in advance for any suggestions.
>> > -Aaron
>> >
>> >
>
>
>
>
> --
> Aaron Ten Clay
> https://aarontc.com



-- 
Aaron Ten Clay
https://aarontc.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Replication (k=1) in LRC

2017-05-04 Thread Oleg Kolosov
Hi Loic
Commenting out the sanity check did the trick. The code is working as I'd
expected.

Thanks

On Fri, Apr 28, 2017 at 1:48 AM, Loic Dachary  wrote:

>
>
> On 04/27/2017 11:43 PM, Oleg Kolosov wrote:
> > Hi Loic,
> > Of course.
> > I'm implementing a version of Pyramid Code. In Pyramid you remove one of
> the global parities of Reed-Solomon and add one local parity for each local
> group. In my version, I'd like to add local parity to the global parity
> (meaning that for the case the global parity = 1, it would be replicated).
> This way in case of a failure in the global parity, you can reconstruct it
> using the replicated node instead of reconstructing it will all K nodes.
> >
> > This is my profile:
> > ceph osd erasure-code-profile set myprofile \
> > plugin=lrc \
> > mapping=DD_DD___ \
> > layers='[
> > [ "DD_DD_c_", "" ],
> > [ "DDc_", "" ],
> > [ "___DDc__", "" ],
> > [ "__Dc", "" ],
> > ]' \
> > ruleset-steps='[
> > [ "chooseleaf", "osd",  8  ],
> > ]'
>
> You could test and see if commenting out the sanity check at
>
> https://github.com/ceph/ceph/blob/master/src/erasure-code/
> jerasure/ErasureCodeJerasure.cc#L89
>
> does the trick. I don't remember enough about this border case to be sure
> it won't work. You can also give it a try with
>
> https://github.com/ceph/ceph/blob/master/src/test/erasure-
> code/ceph_erasure_code_benchmark.cc
>
> Cheers
>
> > Regards,
> > Oleg
> >
> > On Fri, Apr 28, 2017 at 12:33 AM, Loic Dachary  > wrote:
> >
> > Hi Oleg,
> >
> > On 04/27/2017 11:23 PM, Oleg Kolosov wrote:
> > > Hi,
> > > I'm working on various implementation of LRC codes for study
> purposes. The layers implementation in the LRC module is very convenient
> for this, but I've came upon a problem in one of the cases.
> > > I'm interested in having k=1, m=1 in one of the layers. However
> this gives out an error:
> > > Error EINVAL: k=1 must be >= 2
> > >
> > > I should point out that my erasure code has additional layers
> which are fine, only this one has k=1, m=1.
> > >
> > > What was the reason for this issue?
> > > Can replication be implemented in one of LRC's layers?
> >
> > Could you provide the code for me to reproduce this problem ? Or a
> description of the layers ? I implemented this restriction because it made
> the code simpler. And also because I could not think of a valid use case.
> >
> > Cheers
> >
> > --
> > Loïc Dachary, Artisan Logiciel Libre
> >
> >
>
> --
> Loïc Dachary, Artisan Logiciel Libre
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Monitor issues

2017-05-04 Thread Curt Beason
Hello,

So at some point during the night, our monitor 1 server rebooted for so far
unknown reason.  When it came back up, the clock was skewed by 6 hours.
There were no right happening when I got alerted to the issue.  ceph shows
all OSD's up and in, but no op/s and 600+ blocked requests.  I logged into
mon1, fixed the clock and restarted it.  Ceph status, showed all mons up,
no skew, but still no op/s.

Check the OSD logs, see cephx auth errors, which can be caused by clock
skew, from ceph website.  So try to restart the one osd to check and same
thing.  So I stopped mon1, figuring it would roll over to use mon2/3 and
get us backup and running.

Well, the OSD weren't showing as up, so I check my ceph.conf file to see
why it wasn't failing over to mon2/3 and notice it only has the ip for
mon1, so update ceph.conf with the ip for mon2/3 and restart, OSD come back
up and start talking again.

So right now, mon1 is offline, and I only have mon2/3 running.  Without
knowing why mon1 was having issues, I don't want to start it and bring it
back in, just to have the cluster freak.  At the same time, I'd like to get
back to having a quorum. I'm still review the logs on mon1 to try and see
if there are any errors that might point me to the issue.

In the mean time, my questions are.  Do you think it would be worth trying
starting mon1 again and see what happens?  If it still has issues, will my
OSD's failover to mon2/3 now that the conf is correct?  Is there any other
issues that might arise from bring it back in?

The other option I could think of would be deploy a new monitor 4 and then
remove the monitor 1, but I think this could lead to other issues if I am
reading the docs correct on correctly.

All our PG's are active+clean, so the cluster is in a healthy state.  The
only warn is from having set no scrub and no deep scrub and 1 mon being
down.

Any advice would be greatly appreciated.  Sorry for the long windedness of
it and scattered thought process.

Thanks,
Curt
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to calculate the nearfull ratio ?

2017-05-04 Thread Gregory Farnum
On Thu, May 4, 2017 at 5:30 AM Loic Dachary  wrote:

> Hi,
>
> In a cluster where the failure domain is the host and dozens of hosts, the
> 85% default for nearfull ratio is fine. A host failing won't suddenly make
> the cluster 99% full. In smaller clusters, with 10 hosts or less, it is
> likely to not be enough. And in larger clusters 85% may be too much to
> reserve and 90% could be more than enough.
>
> Is there a way to calculate the optimum nearfull ratio for a given
> crushmap ?


Is failure recovery the primary concern with the nearfull flag? Best I can
recall it was initially more about the CRUSH placement imbalance and
preventing any one OSD from going actually full and halting the whole
cluster.
-Greg



>
> Cheers
>
> --
> Loïc Dachary, Artisan Logiciel Libre
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-04 Thread Brian Andrus
Hi Stefan - we simply disabled exclusive-lock on all older (pre-jewel)
images. We still allow the default jewel featuresets for newly created
images because as you mention - the issue does not seem to affect them.

On Thu, May 4, 2017 at 10:19 AM, Stefan Priebe - Profihost AG <
s.pri...@profihost.ag> wrote:

> Hello Brian,
>
> this really sounds the same. I don't see this on a cluster with only
> images created AFTER jewel. And it seems to start happening after i
> enabled exclusive lock on all images.
>
> Did just use feature disable, exclusive-lock,fast-diff,object-map or did
> you also restart all those vms?
>
> Greets,
> Stefan
>
> Am 04.05.2017 um 19:11 schrieb Brian Andrus:
> > Sounds familiar... and discussed in "disk timeouts in libvirt/qemu
> VMs..."
> >
> > We have not had this issue since reverting exclusive-lock, but it was
> > suggested this was not the issue. So far it's held up for us with not a
> > single corrupt filesystem since then.
> >
> > On some images (ones created post-Jewel upgrade) the feature could not
> > be disabled, but these don't seem to be affected. Of course, we never
> > did pinpoint the cause of timeouts, so it's entirely possible something
> > else was causing it but no other major changes went into effect.
> >
> > One thing to look for that might confirm the same issue are timeouts in
> > the guest VM. Most OS kernel will report a hung task in conjunction with
> > the hang up/lock/corruption. Wondering if you're seeing that too.
> >
> > On Wed, May 3, 2017 at 10:49 PM, Stefan Priebe - Profihost AG
> > > wrote:
> >
> > Hello,
> >
> > since we've upgraded from hammer to jewel 10.2.7 and enabled
> > exclusive-lock,object-map,fast-diff we've problems with corrupting
> VM
> > filesystems.
> >
> > Sometimes the VMs are just crashing with FS errors and a restart can
> > solve the problem. Sometimes the whole VM is not even bootable and we
> > need to import a backup.
> >
> > All of them have the same problem that you can't revert to an older
> > snapshot. The rbd command just hangs at 99% forever.
> >
> > Is this a known issue - anythink we can check?
> >
> > Greets,
> > Stefan
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> >
> >
> >
> >
> > --
> > Brian Andrus | Cloud Systems Engineer | DreamHost
> > brian.and...@dreamhost.com | www.dreamhost.com  >
>



-- 
Brian Andrus | Cloud Systems Engineer | DreamHost
brian.and...@dreamhost.com | www.dreamhost.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-04 Thread Stefan Priebe - Profihost AG
and yes i also see hung tasks in those VMs until they crash.

Stefan

Am 04.05.2017 um 19:11 schrieb Brian Andrus:
> Sounds familiar... and discussed in "disk timeouts in libvirt/qemu VMs..."
> 
> We have not had this issue since reverting exclusive-lock, but it was
> suggested this was not the issue. So far it's held up for us with not a
> single corrupt filesystem since then.
> 
> On some images (ones created post-Jewel upgrade) the feature could not
> be disabled, but these don't seem to be affected. Of course, we never
> did pinpoint the cause of timeouts, so it's entirely possible something
> else was causing it but no other major changes went into effect. 
> 
> One thing to look for that might confirm the same issue are timeouts in
> the guest VM. Most OS kernel will report a hung task in conjunction with
> the hang up/lock/corruption. Wondering if you're seeing that too.
> 
> On Wed, May 3, 2017 at 10:49 PM, Stefan Priebe - Profihost AG
> > wrote:
> 
> Hello,
> 
> since we've upgraded from hammer to jewel 10.2.7 and enabled
> exclusive-lock,object-map,fast-diff we've problems with corrupting VM
> filesystems.
> 
> Sometimes the VMs are just crashing with FS errors and a restart can
> solve the problem. Sometimes the whole VM is not even bootable and we
> need to import a backup.
> 
> All of them have the same problem that you can't revert to an older
> snapshot. The rbd command just hangs at 99% forever.
> 
> Is this a known issue - anythink we can check?
> 
> Greets,
> Stefan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 
> 
> -- 
> Brian Andrus | Cloud Systems Engineer | DreamHost
> brian.and...@dreamhost.com | www.dreamhost.com 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-04 Thread Stefan Priebe - Profihost AG
Hello Brian,

this really sounds the same. I don't see this on a cluster with only
images created AFTER jewel. And it seems to start happening after i
enabled exclusive lock on all images.

Did just use feature disable, exclusive-lock,fast-diff,object-map or did
you also restart all those vms?

Greets,
Stefan

Am 04.05.2017 um 19:11 schrieb Brian Andrus:
> Sounds familiar... and discussed in "disk timeouts in libvirt/qemu VMs..."
> 
> We have not had this issue since reverting exclusive-lock, but it was
> suggested this was not the issue. So far it's held up for us with not a
> single corrupt filesystem since then.
> 
> On some images (ones created post-Jewel upgrade) the feature could not
> be disabled, but these don't seem to be affected. Of course, we never
> did pinpoint the cause of timeouts, so it's entirely possible something
> else was causing it but no other major changes went into effect. 
> 
> One thing to look for that might confirm the same issue are timeouts in
> the guest VM. Most OS kernel will report a hung task in conjunction with
> the hang up/lock/corruption. Wondering if you're seeing that too.
> 
> On Wed, May 3, 2017 at 10:49 PM, Stefan Priebe - Profihost AG
> > wrote:
> 
> Hello,
> 
> since we've upgraded from hammer to jewel 10.2.7 and enabled
> exclusive-lock,object-map,fast-diff we've problems with corrupting VM
> filesystems.
> 
> Sometimes the VMs are just crashing with FS errors and a restart can
> solve the problem. Sometimes the whole VM is not even bootable and we
> need to import a backup.
> 
> All of them have the same problem that you can't revert to an older
> snapshot. The rbd command just hangs at 99% forever.
> 
> Is this a known issue - anythink we can check?
> 
> Greets,
> Stefan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 
> 
> -- 
> Brian Andrus | Cloud Systems Engineer | DreamHost
> brian.and...@dreamhost.com | www.dreamhost.com 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-04 Thread Brian Andrus
Sounds familiar... and discussed in "disk timeouts in libvirt/qemu VMs..."

We have not had this issue since reverting exclusive-lock, but it was
suggested this was not the issue. So far it's held up for us with not a
single corrupt filesystem since then.

On some images (ones created post-Jewel upgrade) the feature could not be
disabled, but these don't seem to be affected. Of course, we never did
pinpoint the cause of timeouts, so it's entirely possible something else
was causing it but no other major changes went into effect.

One thing to look for that might confirm the same issue are timeouts in the
guest VM. Most OS kernel will report a hung task in conjunction with the
hang up/lock/corruption. Wondering if you're seeing that too.

On Wed, May 3, 2017 at 10:49 PM, Stefan Priebe - Profihost AG <
s.pri...@profihost.ag> wrote:

> Hello,
>
> since we've upgraded from hammer to jewel 10.2.7 and enabled
> exclusive-lock,object-map,fast-diff we've problems with corrupting VM
> filesystems.
>
> Sometimes the VMs are just crashing with FS errors and a restart can
> solve the problem. Sometimes the whole VM is not even bootable and we
> need to import a backup.
>
> All of them have the same problem that you can't revert to an older
> snapshot. The rbd command just hangs at 99% forever.
>
> Is this a known issue - anythink we can check?
>
> Greets,
> Stefan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Brian Andrus | Cloud Systems Engineer | DreamHost
brian.and...@dreamhost.com | www.dreamhost.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Reg: PG

2017-05-04 Thread David Turner
If you delete and recreate the pools you will indeed lose data.  Your
cephfs_metadata pool will have almost no data in it.  I have a 9TB
cephfs_data pool and 40MB in the cephfs_metadata pool.  It shouldn't have
anywhere near 128 PGs in it based on a cluster this size.  When you
increase your cluster size you will want to keep track of how many PGs you
have per OSD to maintain your desired ratio.

Like I mentioned earlier, this warning isn't a critical issue as long as
you have enough memory to handle this many PGs per OSD daemon.  Assume at
least 3x memory usage during recovery over what it uses while it's
healthy.  As long as you have the system resources to handle this, you can
increase the warning threshold so your cluster is health_ok again.

On Thu, May 4, 2017 at 11:33 AM psuresh  wrote:

> Hi David,
>
> Thanks for your explanation.   I have ran following command to create pg
> pool.
>
> ceph osd pool create cephfs_data 128
> ceph osd pool create cephfs_metadata 128
> ceph fs new dev-ceph-setup cephfs_metadata cephfs_data
>
> Is it a proper way for 3 osd?
>
> Does delete and recreate pg pools will have data loss on ceph cluster?
>
> In future if i increase osd count do i need to change pg pool size?
>
> Regards,
> Suresh
>
>  On Thu, 04 May 2017 20:11:45 +0530 *David Turner
> >* wrote 
>
> I'm guessing you have more than just the 1 pool with 128 PGs in your
> cluster (seeing as you have 320 PGs total, I would guess 2 pools with 128
> PGs and 1 pool with 64 PGs).  The combined total number of PGs for all of
> your pools is 320 and with only 3 OSDs and most likely replica size 3...
> that leaves you with too many (320) PGs per OSD.  This will not likely
> affect your testing, but if you want to fix the problem you will need to
> delete and recreate your pools with a combined lower total number of PGs.
>
> The number of PGs is supposed to reflect how much data each pool is going
> to have.  If you have 1 pool that will have 75% of your cluster's data,
> another pool with 20%, and a third pool with 5%... then the number of PGs
> they have should reflect that.  Based on trying to have somewhere between
> 100-200 PGs per osd, and the above estimation for data distribution, you
> should have 128 PGs in the first pool, 32 PGs in the second, and 8 PGs in
> the third.  Each OSD would have 168 PGs and each PG will be roughly the
> same size between each pool.  If you were to add more OSDs, then you would
> need to increase those numbers to account for the additional OSDs to
> maintain the same distribution.  The above math is only for 3 OSDs.  If you
> had 6 OSDs, then the goal would be to have somewhere between 200-400 PGs
> total to maintain the same 100-200 PGs per OSD.
>
> On Thu, May 4, 2017 at 10:24 AM psuresh  wrote:
>
>
> Hi,
>
> I'm running 3 osd in my test setup.   I have created PG pool with 128 as
> per the ceph documentation.
> But i'm getting too many PGs warning.   Can anyone clarify? why i'm
> getting this warning.
>
> Each OSD contain 240GB disk.
>
> cluster 9d325da2-3d87-4b6b-8cca-e52a4b65aa08
>  health HEALTH_WARN
>* too many PGs per OSD (320 > max 300)*
>  monmap e2: 3 mons at
> {dev-ceph-mon1:6789/0,dev-ceph-mon2:6789/0,dev-ceph-mon3:6789/0}
> election epoch 6, quorum 0,1,2
> dev-ceph-mon1,dev-ceph-mon2,dev-ceph-mon3
>   fsmap e40: 1/1/1 up {0=dev-ceph-mds-active=up:active}
>  osdmap e356: 3 osds: 3 up, 3 in
> flags sortbitwise,require_jewel_osds
>   pgmap v32407: 320 pgs, 3 pools, 27456 MB data, 220 kobjects
> 100843 MB used, 735 GB / 833 GB avail
>  320 active+clean
>
> Regards,
> Suresh
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to calculate the nearfull ratio ?

2017-05-04 Thread Loic Dachary


On 05/04/2017 03:58 PM, Xavier Villaneau wrote:
> Hello Loïc,
> 
> On Thu, May 4, 2017 at 8:30 AM Loic Dachary  > wrote:
> 
> Is there a way to calculate the optimum nearfull ratio for a given 
> crushmap ?
> 
> 
> This is a question that I was planning to cover in those calculations I was 
> working on for python-crush. I've currently shelved the work for a few weeks 
> but intend to look at it again as time frees up.

Of course ! Now I see how the two are related.

Thanks.

> Basically, I see this as a five-fold uncertainty problem:
> 1. CRUSH mappings are pseudo-random and therefore (usually) uneven
> 2. Object distribution between placement groups has the exact same issue
> 3. Object size within a given pool can also vary greatly (from bytes to 
> megabytes)
> 4. Failures and the following re-balancing are also random.
> 5. Finally, pools can occupy different and overlapping sets of OSDs, and hold 
> independent sets of objects.
> 
> Thanks to your new CRUSH tools, I think #1 and #4 are solved respectively by 
> the ability to:
> - generate a CRUSH map for a precise (and even) distribution of PGs;
> - test mappings for every scenario of N failures and find the worst-case 
> scenario (very expensive calculation, but possible).
> 
> Issues #2 and #3 are more tricky. The big picture is that a given amount of 
> data is placed more evenly the more objects there are, and there should be a 
> way to use statistics to quantify that. Variance in object size then brings 
> in more uncertainty, but I think that metric is difficult to quantify outside 
> of very specific use cases where object size are known.
> 
> Finally, this might all be made redundant by the new auto-rebalancing feature 
> that Sage is planning for Luminous. If we can assume even data placement at 
> all times the #4 is the only thing we need to worry about. For 
> performance-based placement that would be very different however. And if 
> pools have overlapping OSD sets, that could be fairly tricky too.
> 
> Maybe some other users here already have some rule of thumb or actual 
> calculations for that. I was planning to get into the statistical 
> calculations of data placement assuming unique object size as the next step 
> for the paper I am working on. Would there be a need for such tools?
> 
> Regards,
> -- 
> Xavier Villaneau
> Storage Software Eng. at Concurrent Computer Corp.
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Reg: PG

2017-05-04 Thread psuresh
 Hi David,



Thanks for your explanation.   I have ran following command to create pg pool. 



ceph osd pool create cephfs_data 128

ceph osd pool create cephfs_metadata 128

ceph fs new dev-ceph-setup cephfs_metadata cephfs_data



Is it a proper way for 3 osd?   



Does delete and recreate pg pools will have data loss on ceph cluster?



In future if i increase osd count do i need to change pg pool size?  



Regards,

Suresh 



 On Thu, 04 May 2017 20:11:45 +0530 David Turner 
drakonst...@gmail.com wrote 




I'm guessing you have more than just the 1 pool with 128 PGs in your cluster 
(seeing as you have 320 PGs total, I would guess 2 pools with 128 PGs and 1 
pool with 64 PGs).  The combined total number of PGs for all of your pools is 
320 and with only 3 OSDs and most likely replica size 3... that leaves you with 
too many (320) PGs per OSD.  This will not likely affect your testing, but if 
you want to fix the problem you will need to delete and recreate your pools 
with a combined lower total number of PGs.



The number of PGs is supposed to reflect how much data each pool is going to 
have.  If you have 1 pool that will have 75% of your cluster's data, another 
pool with 20%, and a third pool with 5%... then the number of PGs they have 
should reflect that.  Based on trying to have somewhere between 100-200 PGs per 
osd, and the above estimation for data distribution, you should have 128 PGs in 
the first pool, 32 PGs in the second, and 8 PGs in the third.  Each OSD would 
have 168 PGs and each PG will be roughly the same size between each pool.  If 
you were to add more OSDs, then you would need to increase those numbers to 
account for the additional OSDs to maintain the same distribution.  The above 
math is only for 3 OSDs.  If you had 6 OSDs, then the goal would be to have 
somewhere between 200-400 PGs total to maintain the same 100-200 PGs per OSD.




On Thu, May 4, 2017 at 10:24 AM psuresh psur...@zohocorp.com wrote:







Hi,



I'm running 3 osd in my test setup.   I have created PG pool with 128 as per 
the ceph documentation.   

But i'm getting too many PGs warning.   Can anyone clarify? why i'm getting 
this warning.   



Each OSD contain 240GB disk. 



cluster 9d325da2-3d87-4b6b-8cca-e52a4b65aa08

 health HEALTH_WARN

too many PGs per OSD (320  max 300)

 monmap e2: 3 mons at 
{dev-ceph-mon1:6789/0,dev-ceph-mon2:6789/0,dev-ceph-mon3:6789/0}

election epoch 6, quorum 0,1,2 
dev-ceph-mon1,dev-ceph-mon2,dev-ceph-mon3

  fsmap e40: 1/1/1 up {0=dev-ceph-mds-active=up:active}

 osdmap e356: 3 osds: 3 up, 3 in

flags sortbitwise,require_jewel_osds

  pgmap v32407: 320 pgs, 3 pools, 27456 MB data, 220 kobjects

100843 MB used, 735 GB / 833 GB avail

 320 active+clean



Regards,

Suresh



___

 ceph-users mailing list

 ceph-users@lists.ceph.com

 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Checking the current full and nearfull ratio

2017-05-04 Thread Sage Weil
On Thu, 4 May 2017, Adam Carheden wrote:
> How do I check the full ratio and nearfull ratio of a running cluster?
> 
> I know i can set 'mon osd full ratio' and 'mon osd nearfull ratio' in
> the [global] setting of ceph.conf. But things work fine without those
> lines (uses defaults, obviously).
> 
> They can also be changed with `ceph tell mon.* injectargs
> "--mon_osd_full_ratio .##` and `ceph tell mon.* injectargs
> "--mon_osd_nearfull_ratio .##`, in which case the running cluster's
> notion of full/nearfull wouldn't match ceph.conf.

Sort of.. those configs set the initial values, but the ones that are 
applied are actually in PGMap.  Look at 'ceph pg dump | head' and adjust 
the values with 'ceph pg set_full_ratio' and 'ceph pg set_nearfull_ratio'.

Note that this is improved and cleaned up in luminous (the commands swithc 
to 'ceph osd set-[near]full-ratio' and the values move into the OSDMap, 
aong with the other full configurables (failsafe ratio, and ratio at which 
backfill is stopped).
 
> How do I have monitors report the values they're currently running with?
> (i.e. is there something like `ceph tell mon.* dumpargs...`?)
> 
> It seems like this should be a pretty basic question, but my Googlefoo
> is failing me this morning.
> 
> For those who find this post and want to check how full their OSDs are
> rather than checking the full/nearfull limits, `ceph osd df tree` seems
> to be the hot ticket.
> 
> 
> And as long as I'm posting, I may as well get my next question out of
> the way. My minimally used 4-node, 16 OSD test cluster looks like this:
> # ceph osd df tree
> 
> MIN/MAX VAR: 0.75/1.31  STDDEV: 0.84
> 
> When should one be concerned about imbalance? What values for
> min/max/stddev represent problems where reweighing an OSD (or other
> action) is What sort of advisable? Is that the purpose of nearfull or
> does one need to monitor individual OSDs too?

You can use 'osd reweight-by-utilization' to reduce the variance.

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Checking the current full and nearfull ratio

2017-05-04 Thread Adam Carheden
How do I check the full ratio and nearfull ratio of a running cluster?

I know i can set 'mon osd full ratio' and 'mon osd nearfull ratio' in
the [global] setting of ceph.conf. But things work fine without those
lines (uses defaults, obviously).

They can also be changed with `ceph tell mon.* injectargs
"--mon_osd_full_ratio .##` and `ceph tell mon.* injectargs
"--mon_osd_nearfull_ratio .##`, in which case the running cluster's
notion of full/nearfull wouldn't match ceph.conf.

How do I have monitors report the values they're currently running with?
(i.e. is there something like `ceph tell mon.* dumpargs...`?)

It seems like this should be a pretty basic question, but my Googlefoo
is failing me this morning.

For those who find this post and want to check how full their OSDs are
rather than checking the full/nearfull limits, `ceph osd df tree` seems
to be the hot ticket.


And as long as I'm posting, I may as well get my next question out of
the way. My minimally used 4-node, 16 OSD test cluster looks like this:
# ceph osd df tree

MIN/MAX VAR: 0.75/1.31  STDDEV: 0.84

When should one be concerned about imbalance? What values for
min/max/stddev represent problems where reweighing an OSD (or other
action) is What sort of advisable? Is that the purpose of nearfull or
does one need to monitor individual OSDs too?


-- 
Adam Carheden

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Reg: PG

2017-05-04 Thread Richard Hesketh
The extra pools are probably the data and metadata pools that are
automatically created for cephfs.

http://ceph.com/pgcalc/ is a useful tool for helping to work out how
many PGs your pools should have.

Rich

On 04/05/17 15:41, David Turner wrote:
> I'm guessing you have more than just the 1 pool with 128 PGs in your
> cluster (seeing as you have 320 PGs total, I would guess 2 pools with
> 128 PGs and 1 pool with 64 PGs).  The combined total number of PGs for
> all of your pools is 320 and with only 3 OSDs and most likely replica
> size 3... that leaves you with too many (320) PGs per OSD.  This will
> not likely affect your testing, but if you want to fix the problem you
> will need to delete and recreate your pools with a combined lower
> total number of PGs.
>
> The number of PGs is supposed to reflect how much data each pool is
> going to have.  If you have 1 pool that will have 75% of your
> cluster's data, another pool with 20%, and a third pool with 5%...
> then the number of PGs they have should reflect that.  Based on trying
> to have somewhere between 100-200 PGs per osd, and the above
> estimation for data distribution, you should have 128 PGs in the first
> pool, 32 PGs in the second, and 8 PGs in the third.  Each OSD would
> have 168 PGs and each PG will be roughly the same size between each
> pool.  If you were to add more OSDs, then you would need to increase
> those numbers to account for the additional OSDs to maintain the same
> distribution.  The above math is only for 3 OSDs.  If you had 6 OSDs,
> then the goal would be to have somewhere between 200-400 PGs total to
> maintain the same 100-200 PGs per OSD.
>
> On Thu, May 4, 2017 at 10:24 AM psuresh  > wrote:
>
> Hi,
>
> I'm running 3 osd in my test setup.   I have created PG pool with
> 128 as per the ceph documentation.  
> But i'm getting too many PGs warning.   Can anyone clarify? why
> i'm getting this warning.  
>
> Each OSD contain 240GB disk.
>
> cluster 9d325da2-3d87-4b6b-8cca-e52a4b65aa08
>  health HEALTH_WARN
>*too many PGs per OSD (320 > max 300)*
>  monmap e2: 3 mons at
> {dev-ceph-mon1:6789/0,dev-ceph-mon2:6789/0,dev-ceph-mon3:6789/0}
> election epoch 6, quorum 0,1,2
> dev-ceph-mon1,dev-ceph-mon2,dev-ceph-mon3
>   fsmap e40: 1/1/1 up {0=dev-ceph-mds-active=up:active}
>  osdmap e356: 3 osds: 3 up, 3 in
> flags sortbitwise,require_jewel_osds
>   pgmap v32407: 320 pgs, 3 pools, 27456 MB data, 220 kobjects
> 100843 MB used, 735 GB / 833 GB avail
>  320 active+clean
>
> Regards,
> Suresh
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Reg: PG

2017-05-04 Thread David Turner
I'm guessing you have more than just the 1 pool with 128 PGs in your
cluster (seeing as you have 320 PGs total, I would guess 2 pools with 128
PGs and 1 pool with 64 PGs).  The combined total number of PGs for all of
your pools is 320 and with only 3 OSDs and most likely replica size 3...
that leaves you with too many (320) PGs per OSD.  This will not likely
affect your testing, but if you want to fix the problem you will need to
delete and recreate your pools with a combined lower total number of PGs.

The number of PGs is supposed to reflect how much data each pool is going
to have.  If you have 1 pool that will have 75% of your cluster's data,
another pool with 20%, and a third pool with 5%... then the number of PGs
they have should reflect that.  Based on trying to have somewhere between
100-200 PGs per osd, and the above estimation for data distribution, you
should have 128 PGs in the first pool, 32 PGs in the second, and 8 PGs in
the third.  Each OSD would have 168 PGs and each PG will be roughly the
same size between each pool.  If you were to add more OSDs, then you would
need to increase those numbers to account for the additional OSDs to
maintain the same distribution.  The above math is only for 3 OSDs.  If you
had 6 OSDs, then the goal would be to have somewhere between 200-400 PGs
total to maintain the same 100-200 PGs per OSD.

On Thu, May 4, 2017 at 10:24 AM psuresh  wrote:

> Hi,
>
> I'm running 3 osd in my test setup.   I have created PG pool with 128 as
> per the ceph documentation.
> But i'm getting too many PGs warning.   Can anyone clarify? why i'm
> getting this warning.
>
> Each OSD contain 240GB disk.
>
> cluster 9d325da2-3d87-4b6b-8cca-e52a4b65aa08
>  health HEALTH_WARN
>* too many PGs per OSD (320 > max 300)*
>  monmap e2: 3 mons at
> {dev-ceph-mon1:6789/0,dev-ceph-mon2:6789/0,dev-ceph-mon3:6789/0}
> election epoch 6, quorum 0,1,2
> dev-ceph-mon1,dev-ceph-mon2,dev-ceph-mon3
>   fsmap e40: 1/1/1 up {0=dev-ceph-mds-active=up:active}
>  osdmap e356: 3 osds: 3 up, 3 in
> flags sortbitwise,require_jewel_osds
>   pgmap v32407: 320 pgs, 3 pools, 27456 MB data, 220 kobjects
> 100843 MB used, 735 GB / 833 GB avail
>  320 active+clean
>
> Regards,
> Suresh
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Performance

2017-05-04 Thread Peter Maloney
On 05/04/17 13:37, Fuxion Cloud wrote:
> Hi,
>
> Our ceph version is 0.80.7. We used it with the openstack as a block
> storage RBD. The ceph storage configured with 3 replication of data.
> I'm getting low IOPS (400)  from fio benchmark in random readwrite.
> Please advise how to improve it. Thanks.
I'll let others comment on whether 0.80.7 is too old and you should
obviously upgrade... I don't think anyone should be using anything older
than hammer which is the previous nearly EoL LTS version.
>
> Here's the hardware info.
> 12 x storage nodes
> - 2 x cpus (12 cores)
> - 64 GB RAM
> - 10 x 4TB SAS 7.2krpm OSD
> - 2 x 200GB SSD Journal
> - 2 x 200GB SSD OS
5 osds per journal sounds like too many.

Which model are the SSDs?
How large are the journals?

When you run your fio test, what is the command you run? which type of
storage? rbd fuse? krbd? fio --engine=rbd?

When you run fio, what does iostat show you? Would you say the HDDs are
the bottleneck, or the SSDs?

iostat -xm 1 /dev/sd[a-z]

> - 2 x 10Gb (bond - ceph network)  
> - 2 x 10Gb (bond - openstack network)
What kind of link do you have between racks?

What is the failure domain? rack or host?

What is the size (replication size) of the pool you are testing?

>
> Ceph status:
>  
>  health HEALTH_OK
>  monmap e1: 3 mons at
> {node1=10.10.10.11:6789/0,node2=10.10.10.12:6789/0,node7=10.10.10.17:6789/0
> },
> election epoch 1030, quorum 0,1,2 node1,node2,node7
>  osdmap e116285: 120 osds: 120 up, 120 in
>   pgmap v70119491: 14384 pgs, 5 pools, 5384 GB data, 841 kobjects
> 16774 GB used, 397 TB / 413 TB avail
>14384 active+clean
>   client io 11456 kB/s rd, 13389 kB/s wr, 420 op/s
>
> Ceph osd tree:
> # idweighttype nameup/downreweight
> -1414root default
> -14207rack rack1
> -334.5host node1
> 13.45osd.1up1
> 43.45osd.4up1
> 73.45osd.7up1
> 103.45osd.10up1
> 133.45osd.13up1
> 163.45osd.16up1
> 193.45osd.19up1
> 223.45osd.22up1
> 253.45osd.25up1
> 283.45osd.28up1
> -434.5host node2
> 53.45osd.5up1
> 113.45osd.11up1
> 143.45osd.14up1
> 173.45osd.17up1
> 203.45osd.20up1
> 233.45osd.23up1
> 263.45osd.26up1
> 293.45osd.29up1
> 383.45osd.38up1
> 23.45osd.2up1
> -534.5host node3
> 313.45osd.31up1
> 483.45osd.48up1
> 573.45osd.57up1
> 663.45osd.66up1
> 753.45osd.75up1
> 843.45osd.84up1
> 933.45osd.93up1
> 1023.45osd.102up1
> 1113.45osd.111up1
> 393.45osd.39up1
> -734.5host node4
> 353.45osd.35up1
> 463.45osd.46up1
> 553.45osd.55up1
> 643.45osd.64up1
> 723.45osd.72up1
> 813.45osd.81up1
> 903.45osd.90up1
> 983.45osd.98up1
> 1073.45osd.107up1
> 1163.45osd.116up1
> -1034.5host node5
> 433.45osd.43up1
> 543.45osd.54up1
> 603.45osd.60up1
> 673.45osd.67up1
> 783.45osd.78up1
> 873.45osd.87up1
> 963.45osd.96up1
> 1043.45osd.104up1
> 1133.45osd.113up1
> 83.45osd.8up1
> -1334.5host node6
> 323.45osd.32up1
> 473.45osd.47up1
> 563.45osd.56up1
> 653.45osd.65up1
> 743.45osd.74up1
> 833.45osd.83up1
> 923.45osd.92up1
> 1103.45osd.110up1
> 1193.45osd.119up1
> 1013.45osd.101up1
> -15207rack rack2
> -234.5host node7
> 03.45osd.0up1
> 33.45osd.3up1
> 63.45osd.6up1
> 93.45osd.9up1
> 123.45osd.12up1
> 153.45osd.15up1
> 183.45osd.18up1
> 213.45osd.21up1
> 243.45osd.24up1
> 273.45osd.27up1
> -634.5host node8
> 303.45osd.30up1
> 403.45osd.40up1
> 493.45osd.49up1
> 583.45osd.58up1
> 683.45osd.68up1
> 773.45osd.77up1
> 863.45osd.86up1
> 953.45osd.95up1
> 1053.45osd.105up1
> 1143.45osd.114up1
> -834.5host node9
> 333.45osd.33up1
> 453.45osd.45up1
> 523.45osd.52up1
> 593.45osd.59up1
> 733.45osd.73up1
> 823.45osd.82up1
> 913.45osd.91up1
> 1003.45osd.100up1
> 1083.45osd.108up1
> 1173.45osd.117up1
> -934.5host node10
> 363.45osd.36up1
> 423.45osd.42up1
> 513.45osd.51up1
> 613.45osd.61up1
> 693.45osd.69up1
> 763.45osd.76up1
> 853.45osd.85up1
> 943.45osd.94up1
> 1033.45osd.103up1
> 1123.45osd.112up1
> -1134.5host node11
> 503.45osd.50up1
> 633.45osd.63up1
> 713.45osd.71up1
> 793.45osd.79up1
> 893.45osd.89up1
> 1063.45osd.106up1
> 1153.45osd.115up1
> 343.45osd.34up1
> 1203.45osd.120up1
> 1213.45osd.121up1
> -1234.5host node12
> 373.45osd.37up1
> 443.45osd.44up1
> 533.45osd.53up1
> 623.45osd.62up1
> 703.45osd.70up1
> 803.45osd.80up1
> 883.45osd.88up1
> 993.45osd.99up1
> 1093.45osd.109up1
> 1183.45osd.118up1
>
>
> Thanks,
> James
>
> On Thu, May 4, 2017 at 5:06 PM, Christian Wuerdig
> > wrote:
>
>
>
> On Thu, May 4, 2017 at 7:53 PM, Fuxion Cloud
> > wrote:
>
> Hi all, 
>
> Im newbie in ceph technology. We have ceph deployed by vendor
> 2 years ago with Ubuntu 14.04LTS without fine tuned the
> performance. I noticed that the performance of storage is very
> slow. Can someone please help to advise how to  improve the
> performance? 
>
>
> You really need to provide a bit more information than that. 

[ceph-users] Reg: PG

2017-05-04 Thread psuresh
Hi,



I'm running 3 osd in my test setup.   I have created PG pool with 128 as per 
the ceph documentation.   

But i'm getting too many PGs warning.   Can anyone clarify? why i'm getting 
this warning.   



Each OSD contain 240GB disk. 



cluster 9d325da2-3d87-4b6b-8cca-e52a4b65aa08

 health HEALTH_WARN

too many PGs per OSD (320  max 300)

 monmap e2: 3 mons at 
{dev-ceph-mon1:6789/0,dev-ceph-mon2:6789/0,dev-ceph-mon3:6789/0}

election epoch 6, quorum 0,1,2 
dev-ceph-mon1,dev-ceph-mon2,dev-ceph-mon3

  fsmap e40: 1/1/1 up {0=dev-ceph-mds-active=up:active}

 osdmap e356: 3 osds: 3 up, 3 in

flags sortbitwise,require_jewel_osds

  pgmap v32407: 320 pgs, 3 pools, 27456 MB data, 220 kobjects

100843 MB used, 735 GB / 833 GB avail

 320 active+clean



Regards,

Suresh


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to calculate the nearfull ratio ?

2017-05-04 Thread Xavier Villaneau
Hello Loïc,

On Thu, May 4, 2017 at 8:30 AM Loic Dachary  wrote:

> Is there a way to calculate the optimum nearfull ratio for a given
> crushmap ?
>

This is a question that I was planning to cover in those calculations I was
working on for python-crush. I've currently shelved the work for a few
weeks but intend to look at it again as time frees up.

Basically, I see this as a five-fold uncertainty problem:
1. CRUSH mappings are pseudo-random and therefore (usually) uneven
2. Object distribution between placement groups has the exact same issue
3. Object size within a given pool can also vary greatly (from bytes to
megabytes)
4. Failures and the following re-balancing are also random.
5. Finally, pools can occupy different and overlapping sets of OSDs, and
hold independent sets of objects.

Thanks to your new CRUSH tools, I think #1 and #4 are solved respectively
by the ability to:
- generate a CRUSH map for a precise (and even) distribution of PGs;
- test mappings for every scenario of N failures and find the worst-case
scenario (very expensive calculation, but possible).

Issues #2 and #3 are more tricky. The big picture is that a given amount of
data is placed more evenly the more objects there are, and there should be
a way to use statistics to quantify that. Variance in object size then
brings in more uncertainty, but I think that metric is difficult to
quantify outside of very specific use cases where object size are known.

Finally, this might all be made redundant by the new auto-rebalancing
feature that Sage is planning for Luminous. If we can assume even data
placement at all times the #4 is the only thing we need to worry about. For
performance-based placement that would be very different however. And if
pools have overlapping OSD sets, that could be fairly tricky too.

Maybe some other users here already have some rule of thumb or actual
calculations for that. I was planning to get into the statistical
calculations of data placement assuming unique object size as the next step
for the paper I am working on. Would there be a need for such tools?

Regards,
-- 
Xavier Villaneau
Storage Software Eng. at Concurrent Computer Corp.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to calculate the nearfull ratio ?

2017-05-04 Thread David Turner
The Ceph Enterprise default is 65% nearfull. Do not go above 85% nearfull
unless you are stuck while backfilling and need to increase it to
add/remove storage. Ceph needs overhead to be able to recover from
situations where disks are lost. I always take into account what would
happen to the %full if I lost a full storage node and used that as my
target size.

The first place to look when trying to use more space is to balance your
cluster to get all disks closer to how fill the cluster is as a whole. By
default osds can be as far apart as 50% while others are 85% and your
cluster % used is right around upper 60's-low 70's.

The most full I've ever maintained a cluster was 77-80% full and it took so
much more babysitting as any time a disk died or we added storage, it
required constant babysitting to make sure things would finish
backfilling.  This cluster has 50 nodes.

On Thu, May 4, 2017, 8:30 AM Loic Dachary  wrote:

> Hi,
>
> In a cluster where the failure domain is the host and dozens of hosts, the
> 85% default for nearfull ratio is fine. A host failing won't suddenly make
> the cluster 99% full. In smaller clusters, with 10 hosts or less, it is
> likely to not be enough. And in larger clusters 85% may be too much to
> reserve and 90% could be more than enough.
>
> Is there a way to calculate the optimum nearfull ratio for a given
> crushmap ?
>
> Cheers
>
> --
> Loïc Dachary, Artisan Logiciel Libre
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-04 Thread Stefan Priebe - Profihost AG
Hi Jason,

> Odd. Can you re-run "rbd rm" with "--debug-rbd=20" added to the
> command and post the resulting log to a new ticket at [1]?

will do so next time. I was able to solve this by restarting all osds.
After that i was able to successfuly delete the image.

> I'd also be interested if you could re-create that
> "librbd::object_map::InvalidateRequest" issue repeatably.
No i can't. It happens randomly for 5-10 images out of 2000.

Greets,
Stefan

Am 04.05.2017 um 14:20 schrieb Jason Dillaman:
> Odd. Can you re-run "rbd rm" with "--debug-rbd=20" added to the
> command and post the resulting log to a new ticket at [1]? I'd also be
> interested if you could re-create that
> "librbd::object_map::InvalidateRequest" issue repeatably.
> 
> [1] http://tracker.ceph.com/projects/rbd/issues
> 
> On Thu, May 4, 2017 at 3:45 AM, Stefan Priebe - Profihost AG
>  wrote:
>> Example:
>> # rbd rm cephstor2/vm-136-disk-1
>> Removing image: 99% complete...
>>
>> Stuck at 99% and never completes. This is an image which got corrupted
>> for an unknown reason.
>>
>> Greets,
>> Stefan
>>
>> Am 04.05.2017 um 08:32 schrieb Stefan Priebe - Profihost AG:
>>> I'm not sure whether this is related but our backup system uses rbd
>>> snapshots and reports sometimes messages like these:
>>> 2017-05-04 02:42:47.661263 7f3316ffd700 -1
>>> librbd::object_map::InvalidateRequest: 0x7f3310002570 should_complete: r=0
>>>
>>> Stefan
>>>
>>>
>>> Am 04.05.2017 um 07:49 schrieb Stefan Priebe - Profihost AG:
 Hello,

 since we've upgraded from hammer to jewel 10.2.7 and enabled
 exclusive-lock,object-map,fast-diff we've problems with corrupting VM
 filesystems.

 Sometimes the VMs are just crashing with FS errors and a restart can
 solve the problem. Sometimes the whole VM is not even bootable and we
 need to import a backup.

 All of them have the same problem that you can't revert to an older
 snapshot. The rbd command just hangs at 99% forever.

 Is this a known issue - anythink we can check?

 Greets,
 Stefan

>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph newbie thoughts and questions

2017-05-04 Thread David Turner
For gluster, when files are written into it as a mounted network gluster
filesystem, it word a lot of metadata for each object to know everything it
needs to about it for replication purposes. If you put the data manually on
the brick then it wouldn't be able to sync.

Correct, 3 mons, 2 mds, and 3 osd nodes is a good place to start. You can
choose to use erasure coding with a 2:1 setup (default if you create the
pool with options for erasure coding) or a replica setup with size 3
(default configuration).

The mds data is stored in the cluster.  I have an erasure coded cephfs that
has 9TB of data in it and the mds service uses 8k on disk (the size of the
folder and the keyring).  This is in my home cluster and I run each node
with 3 osds, a mon, and an mds.  I have replica pools and erasure coded
pools based on which is right for the job.

Failover of the mds works seamlessly for the clients.  The docs recommend
against hyper-converging services because if you do not have enough system
resources, then your daemons can crash/hang do to resource contention.  The
times you will run into resource contention is while your cluster isn't
healthy. Most ceph daemons can use 2-3x more memory while the cluster isn't
healthy as opposed to while it's health_ok.

On Thu, May 4, 2017, 4:17 AM Marcus  wrote:

> Thank you very much for your answer David, just what I was after!
> Just some additional questions to make it clear to me.
> The mds do not need to be in odd numbers?
> They can be set up 1,2,3,4 aso. as needed?
>
> You made the basics clear to me so when I set up my first ceph fs I need
> as a start:
> 3 mons, 2 mds and 3 ods. (To be able to avoid single point of failure)
>
> Is there a clear ratio/relation/approximation between ods and mds?
> If I have, say, 100TB of disk for ods, do I neeed X GB disk for mds?
>
> About gluster, my machines are set up in a gluster cluster today, but the
> reason for thinking about ceph fs for these machines instead is that I have
> problems with replication that I have not been able to solve. Second of all
> is that we get indications from our organisation that data use will expand
> very quickly, and that is where I see that ceph fs will suit us. Easy
> expand as needed.
> Thanks to your description of gluster I will be able to reconfigure my
> gluster cluster and rsync to the mounted cluster. I have used rsync
> directly to the harddrive, and now this is obvious that it does not work
> (worked fine a a single distributed server, but not as a replica). I just
> haven't got this tip from anybody else. Thanks again!
>
> We will start using ceph fs, because this goes hand in hand with our
> future needs.
>
> Best regards
> Marcus
>
>
>
>
> On 04/05/17 06:30, David Turner wrote:
>
> The clients will need to be able to contact the mons and the osds.  NEVER
> use 2 mons.  Mons are a quorum and work best with odd numbers (1, 3, 5,
> etc).  1 mon is better than 2 mons.  It is better to remove the raid and
> put the individual disks as OSDs.  Ceph handles the redundancy through
> replica copies.  It is much better to have a third node for failure domain
> reasons so you can have 3 copies of your data and have 1 in each of the 3
> servers.  The OSDs store their information in broken up objects divvied up
> into PGs that are assigned to the OSDs.  You would need to set up CephFS
> and rsync the data into it to migrate the data into ceph.
>
> I don't usually recommend this, but you might prefer Gluster.  You would
> use the raided disks as the brick in each node.  Set it up to have 2 copies
> (better to have 3 but you only have 2 nodes).  Each server can be used to
> NFS map the gluster mount point.  The files are stored as flat files on the
> bricks, but you would still need to create the gluster first and then rsync
> the data into the mounted gluster instead of directly onto the disk.  With
> this you don't have to worry about the mon service, mds service, osd
> services, balancing the crush map, etc.  Gluster of course has its own
> complexities and limitations, but it might be closer to what you're looking
> for right now.
>
> On Wed, May 3, 2017 at 4:06 PM Marcus Pedersén 
> wrote:
>
>> Hello everybody!
>>
>> I am a newbie on ceph and I really like it and want to try it out.
>> I have a couple of thoughts and questions after reading documentation and
>> need some help to see that I am on the right path.
>>
>> Today I have two file servers in production that I want to start my ceph
>> fs on and expand from that.
>> I want these servers to function as a failover cluster and as I see it I
>> will be able to do it with ceph.
>>
>> To get a failover cluster without a single point of failure I need at
>> least 2 monitors, 2 mds and 2 osd (my existing file servers), right?
>> Today, both of the file servers use a raid on 8 disks. Do I format my
>> raid xfs and run my osds on the raid?
>> Or do I split up my raid and add the disks directly to 

[ceph-users] Rebalancing causing IO Stall/IO Drops to zero

2017-05-04 Thread Osama Hasebou
Hi Everyone, 

We keep running into stalled IOs / they also drop almost to zero, whenever a 
node suddenly would go down or if there was a large amount of rebalancing going 
on and once rebalancing is completed, we would also get stalled io for 2-10 
mins. 

Has anyone seen this behaviour before and found a way to fix this? We are 
seeing this on Ceph Hammer and also on Jewel. 

Thanks. 

Regards, 
Ossi 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to calculate the nearfull ratio ?

2017-05-04 Thread Loic Dachary
Hi,

In a cluster where the failure domain is the host and dozens of hosts, the 85% 
default for nearfull ratio is fine. A host failing won't suddenly make the 
cluster 99% full. In smaller clusters, with 10 hosts or less, it is likely to 
not be enough. And in larger clusters 85% may be too much to reserve and 90% 
could be more than enough.

Is there a way to calculate the optimum nearfull ratio for a given crushmap ?

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-04 Thread Jason Dillaman
Odd. Can you re-run "rbd rm" with "--debug-rbd=20" added to the
command and post the resulting log to a new ticket at [1]? I'd also be
interested if you could re-create that
"librbd::object_map::InvalidateRequest" issue repeatably.

[1] http://tracker.ceph.com/projects/rbd/issues

On Thu, May 4, 2017 at 3:45 AM, Stefan Priebe - Profihost AG
 wrote:
> Example:
> # rbd rm cephstor2/vm-136-disk-1
> Removing image: 99% complete...
>
> Stuck at 99% and never completes. This is an image which got corrupted
> for an unknown reason.
>
> Greets,
> Stefan
>
> Am 04.05.2017 um 08:32 schrieb Stefan Priebe - Profihost AG:
>> I'm not sure whether this is related but our backup system uses rbd
>> snapshots and reports sometimes messages like these:
>> 2017-05-04 02:42:47.661263 7f3316ffd700 -1
>> librbd::object_map::InvalidateRequest: 0x7f3310002570 should_complete: r=0
>>
>> Stefan
>>
>>
>> Am 04.05.2017 um 07:49 schrieb Stefan Priebe - Profihost AG:
>>> Hello,
>>>
>>> since we've upgraded from hammer to jewel 10.2.7 and enabled
>>> exclusive-lock,object-map,fast-diff we've problems with corrupting VM
>>> filesystems.
>>>
>>> Sometimes the VMs are just crashing with FS errors and a restart can
>>> solve the problem. Sometimes the whole VM is not even bootable and we
>>> need to import a backup.
>>>
>>> All of them have the same problem that you can't revert to an older
>>> snapshot. The rbd command just hangs at 99% forever.
>>>
>>> Is this a known issue - anythink we can check?
>>>
>>> Greets,
>>> Stefan
>>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Performance

2017-05-04 Thread Fuxion Cloud
Hi,

Our ceph version is 0.80.7. We used it with the openstack as a block
storage RBD. The ceph storage configured with 3 replication of data. I'm
getting low IOPS (400)  from fio benchmark in random readwrite. Please
advise how to improve it. Thanks.

Here's the hardware info.
12 x storage nodes
- 2 x cpus (12 cores)
- 64 GB RAM
- 10 x 4TB SAS 7.2krpm OSD
- 2 x 200GB SSD Journal
- 2 x 200GB SSD OS
- 2 x 10Gb (bond - ceph network)
- 2 x 10Gb (bond - openstack network)

Ceph status:

 health HEALTH_OK
 monmap e1: 3 mons at {node1=
10.10.10.11:6789/0,node2=10.10.10.12:6789/0,node7=10.10.10.17:6789/0},
election epoch 1030, quorum 0,1,2 node1,node2,node7
 osdmap e116285: 120 osds: 120 up, 120 in
  pgmap v70119491: 14384 pgs, 5 pools, 5384 GB data, 841 kobjects
16774 GB used, 397 TB / 413 TB avail
   14384 active+clean
  client io 11456 kB/s rd, 13389 kB/s wr, 420 op/s

Ceph osd tree:
# id weight type name up/down reweight
-1 414 root default
-14 207 rack rack1
-3 34.5 host node1
1 3.45 osd.1 up 1
4 3.45 osd.4 up 1
7 3.45 osd.7 up 1
10 3.45 osd.10 up 1
13 3.45 osd.13 up 1
16 3.45 osd.16 up 1
19 3.45 osd.19 up 1
22 3.45 osd.22 up 1
25 3.45 osd.25 up 1
28 3.45 osd.28 up 1
-4 34.5 host node2
5 3.45 osd.5 up 1
11 3.45 osd.11 up 1
14 3.45 osd.14 up 1
17 3.45 osd.17 up 1
20 3.45 osd.20 up 1
23 3.45 osd.23 up 1
26 3.45 osd.26 up 1
29 3.45 osd.29 up 1
38 3.45 osd.38 up 1
2 3.45 osd.2 up 1
-5 34.5 host node3
31 3.45 osd.31 up 1
48 3.45 osd.48 up 1
57 3.45 osd.57 up 1
66 3.45 osd.66 up 1
75 3.45 osd.75 up 1
84 3.45 osd.84 up 1
93 3.45 osd.93 up 1
102 3.45 osd.102 up 1
111 3.45 osd.111 up 1
39 3.45 osd.39 up 1
-7 34.5 host node4
35 3.45 osd.35 up 1
46 3.45 osd.46 up 1
55 3.45 osd.55 up 1
64 3.45 osd.64 up 1
72 3.45 osd.72 up 1
81 3.45 osd.81 up 1
90 3.45 osd.90 up 1
98 3.45 osd.98 up 1
107 3.45 osd.107 up 1
116 3.45 osd.116 up 1
-10 34.5 host node5
43 3.45 osd.43 up 1
54 3.45 osd.54 up 1
60 3.45 osd.60 up 1
67 3.45 osd.67 up 1
78 3.45 osd.78 up 1
87 3.45 osd.87 up 1
96 3.45 osd.96 up 1
104 3.45 osd.104 up 1
113 3.45 osd.113 up 1
8 3.45 osd.8 up 1
-13 34.5 host node6
32 3.45 osd.32 up 1
47 3.45 osd.47 up 1
56 3.45 osd.56 up 1
65 3.45 osd.65 up 1
74 3.45 osd.74 up 1
83 3.45 osd.83 up 1
92 3.45 osd.92 up 1
110 3.45 osd.110 up 1
119 3.45 osd.119 up 1
101 3.45 osd.101 up 1
-15 207 rack rack2
-2 34.5 host node7
0 3.45 osd.0 up 1
3 3.45 osd.3 up 1
6 3.45 osd.6 up 1
9 3.45 osd.9 up 1
12 3.45 osd.12 up 1
15 3.45 osd.15 up 1
18 3.45 osd.18 up 1
21 3.45 osd.21 up 1
24 3.45 osd.24 up 1
27 3.45 osd.27 up 1
-6 34.5 host node8
30 3.45 osd.30 up 1
40 3.45 osd.40 up 1
49 3.45 osd.49 up 1
58 3.45 osd.58 up 1
68 3.45 osd.68 up 1
77 3.45 osd.77 up 1
86 3.45 osd.86 up 1
95 3.45 osd.95 up 1
105 3.45 osd.105 up 1
114 3.45 osd.114 up 1
-8 34.5 host node9
33 3.45 osd.33 up 1
45 3.45 osd.45 up 1
52 3.45 osd.52 up 1
59 3.45 osd.59 up 1
73 3.45 osd.73 up 1
82 3.45 osd.82 up 1
91 3.45 osd.91 up 1
100 3.45 osd.100 up 1
108 3.45 osd.108 up 1
117 3.45 osd.117 up 1
-9 34.5 host node10
36 3.45 osd.36 up 1
42 3.45 osd.42 up 1
51 3.45 osd.51 up 1
61 3.45 osd.61 up 1
69 3.45 osd.69 up 1
76 3.45 osd.76 up 1
85 3.45 osd.85 up 1
94 3.45 osd.94 up 1
103 3.45 osd.103 up 1
112 3.45 osd.112 up 1
-11 34.5 host node11
50 3.45 osd.50 up 1
63 3.45 osd.63 up 1
71 3.45 osd.71 up 1
79 3.45 osd.79 up 1
89 3.45 osd.89 up 1
106 3.45 osd.106 up 1
115 3.45 osd.115 up 1
34 3.45 osd.34 up 1
120 3.45 osd.120 up 1
121 3.45 osd.121 up 1
-12 34.5 host node12
37 3.45 osd.37 up 1
44 3.45 osd.44 up 1
53 3.45 osd.53 up 1
62 3.45 osd.62 up 1
70 3.45 osd.70 up 1
80 3.45 osd.80 up 1
88 3.45 osd.88 up 1
99 3.45 osd.99 up 1
109 3.45 osd.109 up 1
118 3.45 osd.118 up 1


Thanks,
James

On Thu, May 4, 2017 at 5:06 PM, Christian Wuerdig <
christian.wuer...@gmail.com> wrote:

>
>
> On Thu, May 4, 2017 at 7:53 PM, Fuxion Cloud 
> wrote:
>
>> Hi all,
>>
>> Im newbie in ceph technology. We have ceph deployed by vendor 2 years ago
>> with Ubuntu 14.04LTS without fine tuned the performance. I noticed that the
>> performance of storage is very slow. Can someone please help to advise how
>> to  improve the performance?
>>
>>
> You really need to provide a bit more information than that. Like what
> hardware is involved (CPU, RAM, how many nodes, how many OSDs, what kind of
> disks, what size disks, networking hardware), how you use ceph (RBD, RGW,
> CephFS, plain RADOS object storage).
>
> Outputs of
>
> ceph status
> ceph osd tree
> ceph df
>
> also provide useful information.
>
> Also what does "slow performance" mean - how have you determined that
> (throughout, latency)?
>
>
>> Any changes or configuration require for OS kernel?
>>
>> Regards,
>> James
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com

Re: [ceph-users] Ceph health warn MDS failing to respond to cache pressure

2017-05-04 Thread Webert de Souza Lima
I have faced the same problem many times. Usually it doesn't cause anything
bad, but I had a 30 min system outage twice because of this.
It might be because of the number of inodes on your ceph filesystem. Go to
the MDS server and do (supposing your mds server id is intcfs-osd1):

 ceph daemon mds.intcfs-osd1 perf dump mds

look for the inodes_max and inodes informations.
inode_max is the maximum inodes to cache and inodes is the current number
of inodes currently in the cache.

if it is full, mount the cephfs with the "-o dirstat" option, and cat the
mountpoint, for example:

 mount -t ceph  10.0.0.1:6789:/ /mnt -o
dirstat,name=admin,secretfile=/etc/ceph/admin.secret
 cat /mnt

look for the rentries number. if it is larger than the inode_max, rise
the mds cache size option in ceph.conf to a number that fits and restart
the mds (beware: this will cause cephfs to stall for a while. do at your
own risk).

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*

On Thu, May 4, 2017 at 3:28 AM, gjprabu  wrote:

> Hi Team,
>
>   We are running cephfs with 5 OSD and 3 Mon and 1 MDS. There is
> Heath Warn "*failing to respond to cache pressure*" . Kindly advise to
> fix this issue.
>
>
> cluster b466e09c-f7ae-4e89-99a7-99d30eba0a13
>  health HEALTH_WARN
> mds0: Client integ-hm8-1.csez.zohocorpin.com failing to
> respond to cache pressure
> mds0: Client integ-hm5 failing to respond to cache pressure
> mds0: Client integ-hm9 failing to respond to cache pressure
> mds0: Client integ-hm2 failing to respond to cache pressure
>  monmap e2: 3 mons at {intcfs-mon1=192.168.113.113:6
> 789/0,intcfs-mon2=192.168.113.114:6789/0,intcfs-mon3=192.168.113.72:6789/0
> }
> election epoch 16, quorum 0,1,2 intcfs-mon3,intcfs-mon1,intcfs
> -mon2
>   fsmap e79409: 1/1/1 up {0=intcfs-osd1=up:active}, 1 up:standby
>  osdmap e3343: 5 osds: 5 up, 5 in
> flags sortbitwise
>   pgmap v13065759: 564 pgs, 3 pools, 5691 GB data, 12134 kobjects
> 11567 GB used, 5145 GB / 16713 GB avail
>  562 active+clean
>2 active+clean+scrubbing+deep
>   client io 8090 kB/s rd, 29032 kB/s wr, 25 op/s rd, 129 op/s wr
>
>
> Regards
> Prabu GJ
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Limit bandwidth on RadosGW?

2017-05-04 Thread Marco Gaiarin
Mandi! Marc Roos
  In chel di` si favelave...

> Just a thought, what about marking connections with iptables and using 
> that mark with tc? 

Surely, but many things have to be taken into account:

a) doing traffic control mean disabling ALL network hardware
 optimizations (queue, offline checksumming, ...), and i don't know
the impact on ceph.

b) doing simple control (eg, traffic clamping on a interface) could add
 little overhead, but if complex setup are needed (multiqueue, traffic
shaping by IP/port/...) i think that more overhead get added.

c) again, doing simple control can be done on ingress easily, but if
 complex setup ar needed the ingress traffic must be routed to another
interface (mostly, IFB interfaces) and proper egress shaping get done
here. In ifbX interfaces, also, there's no netfilter.


I'm using that stuffs on firewall, where performance on modern/decent
hardware is not a trouble at al.

So, no, i've no benchmark at all. ;)

-- 
dott. Marco Gaiarin GNUPG Key ID: 240A3D66
  Associazione ``La Nostra Famiglia''  http://www.lanostrafamiglia.it/
  Polo FVG   -   Via della Bontà, 7 - 33078   -   San Vito al Tagliamento (PN)
  marco.gaiarin(at)lanostrafamiglia.it   t +39-0434-842711   f +39-0434-842797

Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA!
  http://www.lanostrafamiglia.it/index.php/it/sostienici/5x1000
(cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Limit bandwidth on RadosGW?

2017-05-04 Thread Marc Roos
 

Just a thought, what about marking connections with iptables and using 
that mark with tc? 






-Original Message-
From: hrchu [mailto:petertc@gmail.com] 
Sent: donderdag 4 mei 2017 10:35
To: Marc Roos; ceph-users
Subject: Re: [ceph-users] Limit bandwidth on RadosGW?

Thanks for reply.

tc can only do limit on interfaces or given IPs, but what I am talking 
about is "per connection", e.g.,  each put object could be 5MB/s, get 
object could be 1MB/s.

Correct me if anything wrong.


Regards,

Chu, Hua-Rong (曲華榮), +886-3-4227151 #57968 Networklab, Computer 
Science & Information Engineering, National Central University, Jhongli, 
Taiwan R.O.C.

On Thu, May 4, 2017 at 4:01 PM, Marc Roos  
wrote:





No experience with it. But why not use linux for it? Maybe this 
solution
on every RGW is sufficient, I cannot imagine you need 3rd party for
this.

https://unix.stackexchange.com/questions/28198/how-to-limit-network
-bandwidth 
 
https://wiki.archlinux.org/index.php/Advanced_traffic_control 
 




-Original Message-
From: hrchu [mailto:petertc@gmail.com]
Sent: donderdag 4 mei 2017 9:24
To: Ceph Users
Subject: [ceph-users] Limit bandwidth on RadosGW?

Hi all,
I want to limit RadosGW per connection upload/download speed for 
QoS.
There is no build-in option for this, so maybe a 3rd party reverse 
proxy
in front of Radosgw is needed. Does anyone have experience about 
this?






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Performance

2017-05-04 Thread Christian Wuerdig
On Thu, May 4, 2017 at 7:53 PM, Fuxion Cloud  wrote:

> Hi all,
>
> Im newbie in ceph technology. We have ceph deployed by vendor 2 years ago
> with Ubuntu 14.04LTS without fine tuned the performance. I noticed that the
> performance of storage is very slow. Can someone please help to advise how
> to  improve the performance?
>
>
You really need to provide a bit more information than that. Like what
hardware is involved (CPU, RAM, how many nodes, how many OSDs, what kind of
disks, what size disks, networking hardware), how you use ceph (RBD, RGW,
CephFS, plain RADOS object storage).

Outputs of

ceph status
ceph osd tree
ceph df

also provide useful information.

Also what does "slow performance" mean - how have you determined that
(throughout, latency)?


> Any changes or configuration require for OS kernel?
>
> Regards,
> James
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Limit bandwidth on RadosGW?

2017-05-04 Thread hrchu
Thanks for reply.

tc can only do limit on interfaces or given IPs, but what I am talking
about is "per connection", e.g.,  each put object could be 5MB/s, get
object could be 1MB/s.

Correct me if anything wrong.


Regards,

Chu, Hua-Rong (曲華榮), +886-3-4227151 #57968
Networklab, Computer Science & Information Engineering,
National Central University, Jhongli, Taiwan R.O.C.

On Thu, May 4, 2017 at 4:01 PM, Marc Roos  wrote:

>
>
>
> No experience with it. But why not use linux for it? Maybe this solution
> on every RGW is sufficient, I cannot imagine you need 3rd party for
> this.
>
> https://unix.stackexchange.com/questions/28198/how-to-
> limit-network-bandwidth
> https://wiki.archlinux.org/index.php/Advanced_traffic_control
>
>
>
> -Original Message-
> From: hrchu [mailto:petertc@gmail.com]
> Sent: donderdag 4 mei 2017 9:24
> To: Ceph Users
> Subject: [ceph-users] Limit bandwidth on RadosGW?
>
> Hi all,
> I want to limit RadosGW per connection upload/download speed for QoS.
> There is no build-in option for this, so maybe a 3rd party reverse proxy
> in front of Radosgw is needed. Does anyone have experience about this?
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph newbie thoughts and questions

2017-05-04 Thread Marcus

Thank you very much for your answer David, just what I was after!

Just some additional questions to make it clear to me.
The mds do not need to be in odd numbers?
They can be set up 1,2,3,4 aso. as needed?

You made the basics clear to me so when I set up my first ceph fs I need 
as a start:

3 mons, 2 mds and 3 ods. (To be able to avoid single point of failure)

Is there a clear ratio/relation/approximation between ods and mds?
If I have, say, 100TB of disk for ods, do I neeed X GB disk for mds?

About gluster, my machines are set up in a gluster cluster today, but 
the reason for thinking about ceph fs for these machines instead is that 
I have problems with replication that I have not been able to solve. 
Second of all is that we get indications from our organisation that data 
use will expand very quickly, and that is where I see that ceph fs will 
suit us. Easy expand as needed.
Thanks to your description of gluster I will be able to reconfigure my 
gluster cluster and rsync to the mounted cluster. I have used rsync 
directly to the harddrive, and now this is obvious that it does not work 
(worked fine a a single distributed server, but not as a replica). I 
just haven't got this tip from anybody else. Thanks again!


We will start using ceph fs, because this goes hand in hand with our 
future needs.


Best regards
Marcus



On 04/05/17 06:30, David Turner wrote:
The clients will need to be able to contact the mons and the osds.  
NEVER use 2 mons.  Mons are a quorum and work best with odd numbers 
(1, 3, 5, etc).  1 mon is better than 2 mons.  It is better to remove 
the raid and put the individual disks as OSDs.  Ceph handles the 
redundancy through replica copies.  It is much better to have a third 
node for failure domain reasons so you can have 3 copies of your data 
and have 1 in each of the 3 servers.  The OSDs store their information 
in broken up objects divvied up into PGs that are assigned to the 
OSDs.  You would need to set up CephFS and rsync the data into it to 
migrate the data into ceph.


I don't usually recommend this, but you might prefer Gluster.  You 
would use the raided disks as the brick in each node.  Set it up to 
have 2 copies (better to have 3 but you only have 2 nodes).  Each 
server can be used to NFS map the gluster mount point.  The files are 
stored as flat files on the bricks, but you would still need to create 
the gluster first and then rsync the data into the mounted gluster 
instead of directly onto the disk.  With this you don't have to worry 
about the mon service, mds service, osd services, balancing the crush 
map, etc.  Gluster of course has its own complexities and limitations, 
but it might be closer to what you're looking for right now.


On Wed, May 3, 2017 at 4:06 PM Marcus Pedersén > wrote:


Hello everybody!


I am a newbie on ceph and I really like it and want to try it out.
I have a couple of thoughts and questions after reading
documentation and need some help to see that I am on the right path.

Today I have two file servers in production that I want to start
my ceph fs on and expand from that.
I want these servers to function as a failover cluster and as I
see it I will be able to do it with ceph.

To get a failover cluster without a single point of failure I need
at least 2 monitors, 2 mds and 2 osd (my existing file servers),
right?
Today, both of the file servers use a raid on 8 disks. Do I format
my raid xfs and run my osds on the raid?
Or do I split up my raid and add the disks directly to the osds?

When I connect clients to my ceph fs are they talking to the mds
or are the clients talking to the ods directly as well?
If the client just talk to the mds then the ods and the monitor
can be in a separate network and the mds connected both to the
client network and the local "ceph" network.

Today, we have about 11TB data on these file servers, how do I
move the data to the ceph fs? Is it possible to rsync to one of
the ods disks, start the ods daemon and let it replicate itself?

Is it possible to set up the ceph fs with 2 mds, 2 monitors and 1
ods and add the second ods later?
This is to be able to have one file server in production, config
ceph and test with the other, swap to the ceph system and when it
is up and running add the second ods.

Of course I will test this out before I bring it to production.

Many thanks in advance!

Best regards
Marcus


___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--

*Marcus Pedersén*
/System administrator/


*Interbull Centre*
Department of Animal Breeding & Genetics — SLU
Box 7023, SE-750 07

Re: [ceph-users] Limit bandwidth on RadosGW?

2017-05-04 Thread Marc Roos
 


No experience with it. But why not use linux for it? Maybe this solution 
on every RGW is sufficient, I cannot imagine you need 3rd party for 
this.

https://unix.stackexchange.com/questions/28198/how-to-limit-network-bandwidth
https://wiki.archlinux.org/index.php/Advanced_traffic_control



-Original Message-
From: hrchu [mailto:petertc@gmail.com] 
Sent: donderdag 4 mei 2017 9:24
To: Ceph Users
Subject: [ceph-users] Limit bandwidth on RadosGW?

Hi all,
I want to limit RadosGW per connection upload/download speed for QoS.
There is no build-in option for this, so maybe a 3rd party reverse proxy 
in front of Radosgw is needed. Does anyone have experience about this?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Performance

2017-05-04 Thread Fuxion Cloud
Hi all,

Im newbie in ceph technology. We have ceph deployed by vendor 2 years ago
with Ubuntu 14.04LTS without fine tuned the performance. I noticed that the
performance of storage is very slow. Can someone please help to advise how
to  improve the performance?

Any changes or configuration require for OS kernel?

Regards,
James
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-04 Thread Stefan Priebe - Profihost AG
There are no watchers involved:
# rbd status cephstor2/vm-136-disk-1
Watchers: none

Greets,
Stefan

Am 04.05.2017 um 09:45 schrieb Stefan Priebe - Profihost AG:
> Example:
> # rbd rm cephstor2/vm-136-disk-1
> Removing image: 99% complete...
> 
> Stuck at 99% and never completes. This is an image which got corrupted
> for an unknown reason.
> 
> Greets,
> Stefan
> 
> Am 04.05.2017 um 08:32 schrieb Stefan Priebe - Profihost AG:
>> I'm not sure whether this is related but our backup system uses rbd
>> snapshots and reports sometimes messages like these:
>> 2017-05-04 02:42:47.661263 7f3316ffd700 -1
>> librbd::object_map::InvalidateRequest: 0x7f3310002570 should_complete: r=0
>>
>> Stefan
>>
>>
>> Am 04.05.2017 um 07:49 schrieb Stefan Priebe - Profihost AG:
>>> Hello,
>>>
>>> since we've upgraded from hammer to jewel 10.2.7 and enabled
>>> exclusive-lock,object-map,fast-diff we've problems with corrupting VM
>>> filesystems.
>>>
>>> Sometimes the VMs are just crashing with FS errors and a restart can
>>> solve the problem. Sometimes the whole VM is not even bootable and we
>>> need to import a backup.
>>>
>>> All of them have the same problem that you can't revert to an older
>>> snapshot. The rbd command just hangs at 99% forever.
>>>
>>> Is this a known issue - anythink we can check?
>>>
>>> Greets,
>>> Stefan
>>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-04 Thread Stefan Priebe - Profihost AG
Example:
# rbd rm cephstor2/vm-136-disk-1
Removing image: 99% complete...

Stuck at 99% and never completes. This is an image which got corrupted
for an unknown reason.

Greets,
Stefan

Am 04.05.2017 um 08:32 schrieb Stefan Priebe - Profihost AG:
> I'm not sure whether this is related but our backup system uses rbd
> snapshots and reports sometimes messages like these:
> 2017-05-04 02:42:47.661263 7f3316ffd700 -1
> librbd::object_map::InvalidateRequest: 0x7f3310002570 should_complete: r=0
> 
> Stefan
> 
> 
> Am 04.05.2017 um 07:49 schrieb Stefan Priebe - Profihost AG:
>> Hello,
>>
>> since we've upgraded from hammer to jewel 10.2.7 and enabled
>> exclusive-lock,object-map,fast-diff we've problems with corrupting VM
>> filesystems.
>>
>> Sometimes the VMs are just crashing with FS errors and a restart can
>> solve the problem. Sometimes the whole VM is not even bootable and we
>> need to import a backup.
>>
>> All of them have the same problem that you can't revert to an older
>> snapshot. The rbd command just hangs at 99% forever.
>>
>> Is this a known issue - anythink we can check?
>>
>> Greets,
>> Stefan
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Limit bandwidth on RadosGW?

2017-05-04 Thread hrchu
Hi all,
I want to limit RadosGW per connection upload/download speed for QoS.
There is no build-in option for this, so maybe a 3rd party reverse proxy in
front of Radosgw is needed. Does anyone have experience about this?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-04 Thread Stefan Priebe - Profihost AG
I'm not sure whether this is related but our backup system uses rbd
snapshots and reports sometimes messages like these:
2017-05-04 02:42:47.661263 7f3316ffd700 -1
librbd::object_map::InvalidateRequest: 0x7f3310002570 should_complete: r=0

Stefan


Am 04.05.2017 um 07:49 schrieb Stefan Priebe - Profihost AG:
> Hello,
> 
> since we've upgraded from hammer to jewel 10.2.7 and enabled
> exclusive-lock,object-map,fast-diff we've problems with corrupting VM
> filesystems.
> 
> Sometimes the VMs are just crashing with FS errors and a restart can
> solve the problem. Sometimes the whole VM is not even bootable and we
> need to import a backup.
> 
> All of them have the same problem that you can't revert to an older
> snapshot. The rbd command just hangs at 99% forever.
> 
> Is this a known issue - anythink we can check?
> 
> Greets,
> Stefan
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph health warn MDS failing to respond to cache pressure

2017-05-04 Thread gjprabu
Hi Team,



  We are running cephfs with 5 OSD and 3 Mon and 1 MDS. There is Heath 
Warn "failing to respond to cache pressure" . Kindly advise to fix this issue.




cluster b466e09c-f7ae-4e89-99a7-99d30eba0a13

 health HEALTH_WARN

mds0: Client integ-hm8-1.csez.zohocorpin.com failing to respond to 
cache pressure

mds0: Client integ-hm5 failing to respond to cache pressure

mds0: Client integ-hm9 failing to respond to cache pressure

mds0: Client integ-hm2 failing to respond to cache pressure

 monmap e2: 3 mons at 
{intcfs-mon1=192.168.113.113:6789/0,intcfs-mon2=192.168.113.114:6789/0,intcfs-mon3=192.168.113.72:6789/0}

election epoch 16, quorum 0,1,2 intcfs-mon3,intcfs-mon1,intcfs-mon2

  fsmap e79409: 1/1/1 up {0=intcfs-osd1=up:active}, 1 up:standby

 osdmap e3343: 5 osds: 5 up, 5 in

flags sortbitwise

  pgmap v13065759: 564 pgs, 3 pools, 5691 GB data, 12134 kobjects

11567 GB used, 5145 GB / 16713 GB avail

 562 active+clean

   2 active+clean+scrubbing+deep

  client io 8090 kB/s rd, 29032 kB/s wr, 25 op/s rd, 129 op/s wr





Regards

Prabu GJ

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com