date:20180301

The only communication on the private network for ceph is between the OSDs
for replication, Erasure coding, backfilling, and recovery. Everything else
is on the public network. Including communication with clients, mons, MDS,
rgw and, literally everything else.

I haven't used RDMA, but from the question of ceph public network vs
private network, that is what they do. You can decide if you want to have 2
different subnets for them. There have been some threads on the ML about
RDMA and getting it working.

On Fri, Mar 2, 2018, 12:53 AM Justinas LINGYS 
wrote:

> Hi David,
>
> Thank you for your reply. As I understand your experience with multiple
> subnets
> suggests sticking to a single device. However, I have a powerful RDMA NIC
> (100Gbps) with two ports and I have seen recommendations from Mellanox to
> separate the
> two networks. Also, I am planning on having quite a lot of traffic on my
> private network since it's for a research project which uses machine
> learning and it stores a lot of data in a Ceph cluster. Considering my
> case, I assume it is worth the pain separating the two networks to get best
> out the advanced NIC.
>
> Justin
>
> 
> From: David Turner 
> Sent: Thursday, March 1, 2018 9:57:50 PM
> To: Justinas LINGYS
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Ceph and multiple RDMA NICs
>
> There has been some chatter on the ML questioning the need to separate out
> the public and private subnets for Ceph. The trend seems to be in
> simplifying your configuration which for some is not specifying multiple
> subnets here.  I haven't heard of anyone complaining about network problems
> with putting private and public on the same subnets, but I have seen a lot
> of people with networking problems by splitting them up.
>
> Personally I use vlans for the 2 on the same interface at home and I have
> 4 port 10Gb nics at the office, so we split that up as well, but even there
> we might be better suited with bonding all 4 together and using a vlan to
> split traffic.  I wouldn't merge them together since we have graphing on
> our storage nodes for public and private networks.
>
> But the take-away is that if it's too hard to split your public and
> private subnets... don't.  I doubt you would notice any difference if you
> were to get it working vs just not doing it.
>
> On Thu, Mar 1, 2018 at 3:24 AM Justinas LINGYS  > wrote:
> Hi all,
>
> I am running a small Ceph cluster  (1 MON and 3OSDs), and it works fine.
> However, I have a doubt about the two networks (public and cluster) that
> an OSD uses.
> There is a reference from Mellanox (
> https://community.mellanox.com/docs/DOC-2721) how to configure
> 'ceph.conf'. However, after reading the source code (luminous-stable), I
> get a feeling that we cannot run Ceph with two NICs/Ports as we only have
> one 'ms_async_rdma_local_gid' per OSD, and it seems that the source code
> only uses one option (NIC). I would like to ask how I could communicate
> with the public network via one RDMA NIC and communicate  with the cluster
> network via another RDMA NIC (apply RoCEV2 to both NICs). Since gids are
> unique within a machine, how can I use two different gids in 'ceph.conf'?
>
> Justin
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow requests troubleshooting in Luminous - details missing

On Thu, Mar 1, 2018 at 10:57 PM, David Turner  wrote:
> Blocked requests and slow requests are synonyms in ceph. They are 2 names
> for the exact same thing.
>
>
> On Thu, Mar 1, 2018, 10:21 PM Alex Gorbachev  wrote:
>>
>> On Thu, Mar 1, 2018 at 2:47 PM, David Turner 
>> wrote:
>> > `ceph health detail` should show you more information about the slow
>> > requests.  If the output is too much stuff, you can grep out for blocked
>> > or
>> > something.  It should tell you which OSDs are involved, how long they've
>> > been slow, etc.  The default is for them to show '> 32 sec' but that may
>> > very well be much longer and `ceph health detail` will show that.
>>
>> Hi David,
>>
>> Thank you for the reply.  Unfortunately, the health detail only shows
>> blocked requests.  This seems to be related to a compression setting
>> on the pool, nothing in OSD logs.
>>
>> I replied to another compression thread.  This makes sense since
>> compression is new, and in the past all such issues were reflected in
>> OSD logs and related to either network or OSD hardware.
>>
>> Regards,
>> Alex
>>
>> >
>> > On Thu, Mar 1, 2018 at 2:23 PM Alex Gorbachev 
>> > wrote:
>> >>
>> >> Is there a switch to turn on the display of specific OSD issues?  Or
>> >> does the below indicate a generic problem, e.g. network and no any
>> >> specific OSD?
>> >>
>> >> 2018-02-28 18:09:36.438300 7f6dead56700  0
>> >> mon.roc-vm-sc3c234@0(leader).data_health(46) update_stats avail 56%
>> >> total 15997 MB, used 6154 MB, avail 9008 MB
>> >> 2018-02-28 18:09:41.477216 7f6dead56700  0 log_channel(cluster) log
>> >> [WRN] : Health check failed: 73 slow requests are blocked > 32 sec
>> >> (REQUEST_SLOW)
>> >> 2018-02-28 18:09:47.552669 7f6dead56700  0 log_channel(cluster) log
>> >> [WRN] : Health check update: 74 slow requests are blocked > 32 sec
>> >> (REQUEST_SLOW)
>> >> 2018-02-28 18:09:53.794882 7f6de8551700  0
>> >> mon.roc-vm-sc3c234@0(leader) e1 handle_command mon_command({"prefix":
>> >> "status", "format": "json"} v 0) v1
>> >>
>> >> --

I was wrong where the pool compression does not matter, even
uncompressed pool also generates these slow messages.

Question is why no subsequent message relating to specific OSDs (like
in Jewel and prior, like this example from RH:

2015-08-24 13:18:10.024659 osd.1 127.0.0.1:6812/3032 9 : cluster [WRN]
6 slow requests, 6 included below; oldest blocked for > 61.758455 secs

2016-07-25 03:44:06.510583 osd.50 [WRN] slow request 30.005692 seconds
old, received at {date-time}: osd_op(client.4240.0:8
benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4
currently waiting for subops from [610]

In comparison, my Luminous cluster only shows the general slow/blocked message:

2018-03-01 21:52:54.237270 7f7e419e3700  0 log_channel(cluster) log
[WRN] : Health check failed: 116 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-03-01 21:53:00.282721 7f7e419e3700  0 log_channel(cluster) log
[WRN] : Health check update: 66 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-03-01 21:53:08.534244 7f7e419e3700  0 log_channel(cluster) log
[WRN] : Health check update: 5 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-03-01 21:53:10.382510 7f7e419e3700  0 log_channel(cluster) log
[INF] : Health check cleared: REQUEST_SLOW (was: 5 slow requests are
blocked > 32 sec)
2018-03-01 21:53:10.382546 7f7e419e3700  0 log_channel(cluster) log
[INF] : Cluster is now healthy

So where are the details?

Thanks,
Alex
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph and multiple RDMA NICs

2018-03-01 Thread Justinas LINGYS

Hi David,

Thank you for your reply. As I understand your experience with multiple subnets
suggests sticking to a single device. However, I have a powerful RDMA NIC 
(100Gbps) with two ports and I have seen recommendations from Mellanox to 
separate the
two networks. Also, I am planning on having quite a lot of traffic on my 
private network since it's for a research project which uses machine learning 
and it stores a lot of data in a Ceph cluster. Considering my case, I assume it 
is worth the pain separating the two networks to get best out the advanced NIC.

Justin


From: David Turner 
Sent: Thursday, March 1, 2018 9:57:50 PM
To: Justinas LINGYS
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph and multiple RDMA NICs

There has been some chatter on the ML questioning the need to separate out the 
public and private subnets for Ceph. The trend seems to be in simplifying your 
configuration which for some is not specifying multiple subnets here.  I 
haven't heard of anyone complaining about network problems with putting private 
and public on the same subnets, but I have seen a lot of people with networking 
problems by splitting them up.

Personally I use vlans for the 2 on the same interface at home and I have 4 
port 10Gb nics at the office, so we split that up as well, but even there we 
might be better suited with bonding all 4 together and using a vlan to split 
traffic.  I wouldn't merge them together since we have graphing on our storage 
nodes for public and private networks.

But the take-away is that if it's too hard to split your public and private 
subnets... don't.  I doubt you would notice any difference if you were to get 
it working vs just not doing it.

On Thu, Mar 1, 2018 at 3:24 AM Justinas LINGYS 
mailto:jlin...@connect.ust.hk>> wrote:
Hi all,

I am running a small Ceph cluster  (1 MON and 3OSDs), and it works fine.
However, I have a doubt about the two networks (public and cluster) that an OSD 
uses.
There is a reference from Mellanox 
(https://community.mellanox.com/docs/DOC-2721) how to configure 'ceph.conf'. 
However, after reading the source code (luminous-stable), I get a feeling that 
we cannot run Ceph with two NICs/Ports as we only have one 
'ms_async_rdma_local_gid' per OSD, and it seems that the source code only uses 
one option (NIC). I would like to ask how I could communicate with the public 
network via one RDMA NIC and communicate  with the cluster network via another 
RDMA NIC (apply RoCEV2 to both NICs). Since gids are unique within a machine, 
how can I use two different gids in 'ceph.conf'?

Justin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow requests troubleshooting in Luminous - details missing

Blocked requests and slow requests are synonyms in ceph. They are 2 names
for the exact same thing.

On Thu, Mar 1, 2018, 10:21 PM Alex Gorbachev  wrote:

> On Thu, Mar 1, 2018 at 2:47 PM, David Turner 
> wrote:
> > `ceph health detail` should show you more information about the slow
> > requests.  If the output is too much stuff, you can grep out for blocked
> or
> > something.  It should tell you which OSDs are involved, how long they've
> > been slow, etc.  The default is for them to show '> 32 sec' but that may
> > very well be much longer and `ceph health detail` will show that.
>
> Hi David,
>
> Thank you for the reply.  Unfortunately, the health detail only shows
> blocked requests.  This seems to be related to a compression setting
> on the pool, nothing in OSD logs.
>
> I replied to another compression thread.  This makes sense since
> compression is new, and in the past all such issues were reflected in
> OSD logs and related to either network or OSD hardware.
>
> Regards,
> Alex
>
> >
> > On Thu, Mar 1, 2018 at 2:23 PM Alex Gorbachev 
> > wrote:
> >>
> >> Is there a switch to turn on the display of specific OSD issues?  Or
> >> does the below indicate a generic problem, e.g. network and no any
> >> specific OSD?
> >>
> >> 2018-02-28 18:09:36.438300 7f6dead56700  0
> >> mon.roc-vm-sc3c234@0(leader).data_health(46) update_stats avail 56%
> >> total 15997 MB, used 6154 MB, avail 9008 MB
> >> 2018-02-28 18:09:41.477216 7f6dead56700  0 log_channel(cluster) log
> >> [WRN] : Health check failed: 73 slow requests are blocked > 32 sec
> >> (REQUEST_SLOW)
> >> 2018-02-28 18:09:47.552669 7f6dead56700  0 log_channel(cluster) log
> >> [WRN] : Health check update: 74 slow requests are blocked > 32 sec
> >> (REQUEST_SLOW)
> >> 2018-02-28 18:09:53.794882 7f6de8551700  0
> >> mon.roc-vm-sc3c234@0(leader) e1 handle_command mon_command({"prefix":
> >> "status", "format": "json"} v 0) v1
> >>
> >> --
> >> Alex Gorbachev
> >> Storcium
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow requests troubleshooting in Luminous - details missing

On Thu, Mar 1, 2018 at 2:47 PM, David Turner  wrote:
> `ceph health detail` should show you more information about the slow
> requests.  If the output is too much stuff, you can grep out for blocked or
> something.  It should tell you which OSDs are involved, how long they've
> been slow, etc.  The default is for them to show '> 32 sec' but that may
> very well be much longer and `ceph health detail` will show that.

Hi David,

Thank you for the reply.  Unfortunately, the health detail only shows
blocked requests.  This seems to be related to a compression setting
on the pool, nothing in OSD logs.

I replied to another compression thread.  This makes sense since
compression is new, and in the past all such issues were reflected in
OSD logs and related to either network or OSD hardware.

Regards,
Alex

>
> On Thu, Mar 1, 2018 at 2:23 PM Alex Gorbachev 
> wrote:
>>
>> Is there a switch to turn on the display of specific OSD issues?  Or
>> does the below indicate a generic problem, e.g. network and no any
>> specific OSD?
>>
>> 2018-02-28 18:09:36.438300 7f6dead56700  0
>> mon.roc-vm-sc3c234@0(leader).data_health(46) update_stats avail 56%
>> total 15997 MB, used 6154 MB, avail 9008 MB
>> 2018-02-28 18:09:41.477216 7f6dead56700  0 log_channel(cluster) log
>> [WRN] : Health check failed: 73 slow requests are blocked > 32 sec
>> (REQUEST_SLOW)
>> 2018-02-28 18:09:47.552669 7f6dead56700  0 log_channel(cluster) log
>> [WRN] : Health check update: 74 slow requests are blocked > 32 sec
>> (REQUEST_SLOW)
>> 2018-02-28 18:09:53.794882 7f6de8551700  0
>> mon.roc-vm-sc3c234@0(leader) e1 handle_command mon_command({"prefix":
>> "status", "format": "json"} v 0) v1
>>
>> --
>> Alex Gorbachev
>> Storcium
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Memory leak in Ceph OSD?

On Thu, Mar 1, 2018 at 5:37 PM, Subhachandra Chandra
 wrote:
> Even with bluestore we saw memory usage plateau at 3-4GB with 8TB drives
> filled to around 90%. One thing that does increase memory usage is the
> number of clients simultaneously sending write requests to a particular
> primary OSD if the write sizes are large.

We have not seen a memory increase in Ubuntu 16.04, but I also
observed repeatedly the following phenomenon:

When doing a VMotion in ESXi of a large 3TB file (this generates a log
of IO requests of small size) to a Ceph pool with compression set to
force, after some time the Ceph cluster shows a large number of
blocked requests and eventually timeouts become very large (to the
point where ESXi aborts the IO due to timeouts).  After abort, the
blocked/slow requests messages disappear.  There are no OSD errors.  I
have OSD logs if anyone is interested.

This does not occur when compression is unset.

--
Alex Gorbachev
Storcium

>
> Subhachandra
>
> On Thu, Mar 1, 2018 at 6:18 AM, David Turner  wrote:
>>
>> With default memory settings, the general rule is 1GB ram/1TB OSD.  If you
>> have a 4TB OSD, you should plan to have at least 4GB ram.  This was the
>> recommendation for filestore OSDs, but it was a bit much memory for the
>> OSDs.  From what I've seen, this rule is a little more appropriate with
>> bluestore now and should still be observed.
>>
>> Please note that memory usage in a HEALTH_OK cluster is not the same
>> amount of memory that the daemons will use during recovery.  I have seen
>> deployments with 4x memory usage during recovery.
>>
>> On Thu, Mar 1, 2018 at 8:11 AM Stefan Kooman  wrote:
>>>
>>> Quoting Caspar Smit (caspars...@supernas.eu):
>>> > Stefan,
>>> >
>>> > How many OSD's and how much RAM are in each server?
>>>
>>> Currently 7 OSDs, 128 GB RAM. Max wil be 10 OSDs in these servers. 12
>>> cores (at least one core per OSD).
>>>
>>> > bluestore_cache_size=6G will not mean each OSD is using max 6GB RAM
>>> > right?
>>>
>>> Apparently. Sure they will use more RAM than just cache to function
>>> correctly. I figured 3 GB per OSD would be enough ...
>>>
>>> > Our bluestore hdd OSD's with bluestore_cache_size at 1G use ~4GB of
>>> > total
>>> > RAM. The cache is a part of the memory usage by bluestore OSD's.
>>>
>>> A factor 4 is quite high, isn't it? Where is all this RAM used for
>>> besides cache? RocksDB?
>>>
>>> So how should I size the amount of RAM in a OSD server for 10 bluestore
>>> SSDs in a
>>> replicated setup?
>>>
>>> Thanks,
>>>
>>> Stefan
>>>
>>> --
>>> | BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
>>> | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Memory leak in Ceph OSD?

2018-03-01 Thread Subhachandra Chandra

Even with bluestore we saw memory usage plateau at 3-4GB with 8TB drives
filled to around 90%. One thing that does increase memory usage is the
number of clients simultaneously sending write requests to a particular
primary OSD if the write sizes are large.

Subhachandra

On Thu, Mar 1, 2018 at 6:18 AM, David Turner  wrote:

> With default memory settings, the general rule is 1GB ram/1TB OSD.  If you
> have a 4TB OSD, you should plan to have at least 4GB ram.  This was the
> recommendation for filestore OSDs, but it was a bit much memory for the
> OSDs.  From what I've seen, this rule is a little more appropriate with
> bluestore now and should still be observed.
>
> Please note that memory usage in a HEALTH_OK cluster is not the same
> amount of memory that the daemons will use during recovery.  I have seen
> deployments with 4x memory usage during recovery.
>
> On Thu, Mar 1, 2018 at 8:11 AM Stefan Kooman  wrote:
>
>> Quoting Caspar Smit (caspars...@supernas.eu):
>> > Stefan,
>> >
>> > How many OSD's and how much RAM are in each server?
>>
>> Currently 7 OSDs, 128 GB RAM. Max wil be 10 OSDs in these servers. 12
>> cores (at least one core per OSD).
>>
>> > bluestore_cache_size=6G will not mean each OSD is using max 6GB RAM
>> right?
>>
>> Apparently. Sure they will use more RAM than just cache to function
>> correctly. I figured 3 GB per OSD would be enough ...
>>
>> > Our bluestore hdd OSD's with bluestore_cache_size at 1G use ~4GB of
>> total
>> > RAM. The cache is a part of the memory usage by bluestore OSD's.
>>
>> A factor 4 is quite high, isn't it? Where is all this RAM used for
>> besides cache? RocksDB?
>>
>> So how should I size the amount of RAM in a OSD server for 10 bluestore
>> SSDs in a
>> replicated setup?
>>
>> Thanks,
>>
>> Stefan
>>
>> --
>> | BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
>> | GPG: 0xD14839C6   +31 318 648 688
>> <+31%20318%20648%20688> / i...@bit.nl
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow requests troubleshooting in Luminous - details missing

`ceph health detail` should show you more information about the slow
requests.  If the output is too much stuff, you can grep out for blocked or
something.  It should tell you which OSDs are involved, how long they've
been slow, etc.  The default is for them to show '> 32 sec' but that may
very well be much longer and `ceph health detail` will show that.

On Thu, Mar 1, 2018 at 2:23 PM Alex Gorbachev 
wrote:

> Is there a switch to turn on the display of specific OSD issues?  Or
> does the below indicate a generic problem, e.g. network and no any
> specific OSD?
>
> 2018-02-28 18:09:36.438300 7f6dead56700  0
> mon.roc-vm-sc3c234@0(leader).data_health(46) update_stats avail 56%
> total 15997 MB, used 6154 MB, avail 9008 MB
> 2018-02-28 18:09:41.477216 7f6dead56700  0 log_channel(cluster) log
> [WRN] : Health check failed: 73 slow requests are blocked > 32 sec
> (REQUEST_SLOW)
> 2018-02-28 18:09:47.552669 7f6dead56700  0 log_channel(cluster) log
> [WRN] : Health check update: 74 slow requests are blocked > 32 sec
> (REQUEST_SLOW)
> 2018-02-28 18:09:53.794882 7f6de8551700  0
> mon.roc-vm-sc3c234@0(leader) e1 handle_command mon_command({"prefix":
> "status", "format": "json"} v 0) v1
>
> --
> Alex Gorbachev
> Storcium
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Slow requests troubleshooting in Luminous - details missing

Is there a switch to turn on the display of specific OSD issues?  Or
does the below indicate a generic problem, e.g. network and no any
specific OSD?

2018-02-28 18:09:36.438300 7f6dead56700  0
mon.roc-vm-sc3c234@0(leader).data_health(46) update_stats avail 56%
total 15997 MB, used 6154 MB, avail 9008 MB
2018-02-28 18:09:41.477216 7f6dead56700  0 log_channel(cluster) log
[WRN] : Health check failed: 73 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-02-28 18:09:47.552669 7f6dead56700  0 log_channel(cluster) log
[WRN] : Health check update: 74 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-02-28 18:09:53.794882 7f6de8551700  0
mon.roc-vm-sc3c234@0(leader) e1 handle_command mon_command({"prefix":
"status", "format": "json"} v 0) v1

--
Alex Gorbachev
Storcium
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Delete a Pool - how hard should be?

I think this is a good question for everybody: How hard should be delete 
a Pool?


We ask to tell the pool twice.
We ask to add "--yes-i-really-really-mean-it"
We ask to add ability to mons to delete the pool (and remove this 
ability ASAP after).


... and then somebody of course ask us to restore the pool.

I think that all this stuff is not looking in the right direction.
It's not the administrator that need to be warned from delete datas.
It's the data owner that should be warned (which most of the time give 
it's approval by phone and gone).


So, all this stuff just make the life of administrator harder, while not 
improving in any way the life of the Data Owner.
Probably the best solution is to ...do not delete at all and instead 
apply a "deleting policy".

Something like:

   ceph osd pool rm POOL_NAME -yes
   -> POOL_NAME is set to be deleted, removal is scheduled within 30 days.


This allow us to do 2 things:

 * allow administrator to don't waste their time in CML with true
   strange command
 * allow data owner to have a grace period to verify if, after
   deletion, everything works as expected and that data that disapper
   wasn't usefull in some way.

After 30 days data will be removed automatically. This is a safe policy 
for ADMIN and DATA OWNER.
Of course ADMIN should be allowed to remove POOL scheduleded for 
deletion in order to save disk spaces if needed (but only if needed).


What do you think?



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cannot delete a pool


 and now it worked.
 maybe a typo in my first command.

Sorry


Il 01/03/2018 17:28, David Turner ha scritto:
When dealing with the admin socket you need to be an admin.  `sudu` or 
`sudo -u ceph` ought to get you around that.


I was able to delete a pool just by using the injectargs that you 
showed above.


ceph tell mon.\* injectargs '--mon-allow-pool-delete=true'
ceph osd pool rm pool_name pool_name --yes-i-really-really-mean-it
ceph tell mon.\* injectargs '--mon-allow-pool-delete=false'

If you see the warning 'not observed, change may require restart' you 
can check to see if it took effect or not by asking the daemon what 
it's setting is `ceph daemon mon.ceph_node1 config get 
mon_allow_pool_delete`.


On Thu, Mar 1, 2018 at 10:41 AM Max Cuttins > wrote:


I get:

#ceph daemon mon.0 config set mon_allow_pool_delete true
admin_socket: exception getting command descriptions: [Errno 13]
Permission denied


Il 01/03/2018 14:00, Eugen Block ha scritto:
> It's not necessary to restart a mon if you just want to delete a
pool,
> even if the "not observed" message appears. And I would not
recommend
> to permanently enable the "easy" way of deleting a pool. If you are
> not able to delete the pool after "ceph tell mon ..." try this:
>
> ceph daemon mon. config set mon_allow_pool_delete true
>
> and then retry deleting the pool. This works for me without
restarting
> any services or changing config files.
>
> Regards
>
>
> Zitat von Ronny Aasen mailto:ronny%2bceph-us...@aasen.cx>>:
>
>> On 01. mars 2018 13:04, Max Cuttins wrote:
>>> I was testing IO and I created a bench pool.
>>>
>>> But if I tried to delete I get:
>>>
>>>    Error EPERM: pool deletion is disabled; you must first set the
>>>    mon_allow_pool_delete config option to true before you can
destroy a
>>>    pool
>>>
>>> So I run:
>>>
>>>    ceph tell mon.\* injectargs '--mon-allow-pool-delete=true'
>>>    mon.ceph-node1: injectargs:mon_allow_pool_delete = 'true' (not
>>>    observed, change may require restart)
>>>    mon.ceph-node2: injectargs:mon_allow_pool_delete = 'true' (not
>>>    observed, change may require restart)
>>>    mon.ceph-node3: injectargs:mon_allow_pool_delete = 'true' (not
>>>    observed, change may require restart)
>>>
>>> I restarted all the nodes.
>>> But the flag has not been observed.
>>>
>>> Is this the right way to remove a pool?
>>
>> i think you need to set the option in the ceph.conf of the
monitors.
>> and then restart the mon's one by one.
>>
>> afaik that is by design.
>>

https://blog.widodh.nl/2015/04/protecting-your-ceph-pools-against-removal-or-property-changes/
>>
>>
>> kind regards
>> Ronny Aasen
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>

___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cannot delete a pool

When dealing with the admin socket you need to be an admin.  `sudu` or
`sudo -u ceph` ought to get you around that.

I was able to delete a pool just by using the injectargs that you showed
above.

ceph tell mon.\* injectargs '--mon-allow-pool-delete=true'
ceph osd pool rm pool_name pool_name --yes-i-really-really-mean-it
ceph tell mon.\* injectargs '--mon-allow-pool-delete=false'

If you see the warning 'not observed, change may require restart' you can
check to see if it took effect or not by asking the daemon what it's
setting is `ceph daemon mon.ceph_node1 config get mon_allow_pool_delete`.

On Thu, Mar 1, 2018 at 10:41 AM Max Cuttins  wrote:

> I get:
>
> #ceph daemon mon.0 config set mon_allow_pool_delete true
> admin_socket: exception getting command descriptions: [Errno 13]
> Permission denied
>
>
> Il 01/03/2018 14:00, Eugen Block ha scritto:
> > It's not necessary to restart a mon if you just want to delete a pool,
> > even if the "not observed" message appears. And I would not recommend
> > to permanently enable the "easy" way of deleting a pool. If you are
> > not able to delete the pool after "ceph tell mon ..." try this:
> >
> > ceph daemon mon. config set mon_allow_pool_delete true
> >
> > and then retry deleting the pool. This works for me without restarting
> > any services or changing config files.
> >
> > Regards
> >
> >
> > Zitat von Ronny Aasen :
> >
> >> On 01. mars 2018 13:04, Max Cuttins wrote:
> >>> I was testing IO and I created a bench pool.
> >>>
> >>> But if I tried to delete I get:
> >>>
> >>>Error EPERM: pool deletion is disabled; you must first set the
> >>>mon_allow_pool_delete config option to true before you can destroy a
> >>>pool
> >>>
> >>> So I run:
> >>>
> >>>ceph tell mon.\* injectargs '--mon-allow-pool-delete=true'
> >>>mon.ceph-node1: injectargs:mon_allow_pool_delete = 'true' (not
> >>>observed, change may require restart)
> >>>mon.ceph-node2: injectargs:mon_allow_pool_delete = 'true' (not
> >>>observed, change may require restart)
> >>>mon.ceph-node3: injectargs:mon_allow_pool_delete = 'true' (not
> >>>observed, change may require restart)
> >>>
> >>> I restarted all the nodes.
> >>> But the flag has not been observed.
> >>>
> >>> Is this the right way to remove a pool?
> >>
> >> i think you need to set the option in the ceph.conf of the monitors.
> >> and then restart the mon's one by one.
> >>
> >> afaik that is by design.
> >>
> https://blog.widodh.nl/2015/04/protecting-your-ceph-pools-against-removal-or-property-changes/
> >>
> >>
> >> kind regards
> >> Ronny Aasen
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cannot delete a pool

2018-03-01 Thread Jean-Charles Lopez

Hi,

connect to the ceph-node1 machine and run : ceph daemon mon.ceph-node1 config 
set mon_allow_pool_delete true

You are just using the wrong parameter as an ID

JC

> On Mar 1, 2018, at 07:41, Max Cuttins  wrote:
> 
> I get:
> 
> #ceph daemon mon.0 config set mon_allow_pool_delete true
> admin_socket: exception getting command descriptions: [Errno 13] Permission 
> denied
> 
> 
> Il 01/03/2018 14:00, Eugen Block ha scritto:
>> It's not necessary to restart a mon if you just want to delete a pool, even 
>> if the "not observed" message appears. And I would not recommend to 
>> permanently enable the "easy" way of deleting a pool. If you are not able to 
>> delete the pool after "ceph tell mon ..." try this:
>> 
>> ceph daemon mon. config set mon_allow_pool_delete true
>> 
>> and then retry deleting the pool. This works for me without restarting any 
>> services or changing config files.
>> 
>> Regards
>> 
>> 
>> Zitat von Ronny Aasen :
>> 
>>> On 01. mars 2018 13:04, Max Cuttins wrote:
 I was testing IO and I created a bench pool.
 
 But if I tried to delete I get:
 
Error EPERM: pool deletion is disabled; you must first set the
mon_allow_pool_delete config option to true before you can destroy a
pool
 
 So I run:
 
ceph tell mon.\* injectargs '--mon-allow-pool-delete=true'
mon.ceph-node1: injectargs:mon_allow_pool_delete = 'true' (not
observed, change may require restart)
mon.ceph-node2: injectargs:mon_allow_pool_delete = 'true' (not
observed, change may require restart)
mon.ceph-node3: injectargs:mon_allow_pool_delete = 'true' (not
observed, change may require restart)
 
 I restarted all the nodes.
 But the flag has not been observed.
 
 Is this the right way to remove a pool?
>>> 
>>> i think you need to set the option in the ceph.conf of the monitors.
>>> and then restart the mon's one by one.
>>> 
>>> afaik that is by design.
>>> https://blog.widodh.nl/2015/04/protecting-your-ceph-pools-against-removal-or-property-changes/
>>>  
>>> 
>>> kind regards
>>> Ronny Aasen
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 
>> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Case where a separate Bluestore WAL/DB device crashes...

2018-03-01 Thread Jonathan Proulx

On Thu, Mar 01, 2018 at 04:57:59PM +0100, Hervé Ballans wrote:

:Can we find recent benchmarks on this performance issue related to the
:location of WAL/DBs ?

I don't have benchmarks but I have some anecdotes.

we previously had 4T NLSAS (7.2k) filestore data drives with journals
on SSD (5:1 ssd:spinner).  We had unpleasant latency and at about 60%
space utilization we were 80%+ IOPs utilization.

We decided to go with smaller 2T but still slow 7.2k NLSAS drive for
next expantion to spread IOPS over more but still cheap spindles.
This coincided with bluestore going official in luminous so we did not
spec. SSD.

This worked out fairly well on 2T drives thay has similar but slightly
lower IOP utilization and dramaticly improved latency.

Based on this we decided ot do rolling conversions of older 4T servers
to bluestor (they were already luminous), removing the SSD layer with
an eye to making a performace pool out of them later.

This went poorly. Latency improved to the same exent we saw on newer
2T drive but IOPs frequently flatlined at 100% during deep scrubs
resulting in slow requests, blocked PGs and very sad VMs on top of it
all.

We went back and re-reformated the OSDs to use bluestor with db on
ssd.  This kept the improved latency characteristics and dropped IOPs
on spinning disks back to about the same maybe slightly less than
filestore was so not great but acceptable.

Much of this suffering is due to our budgetary requirements being clearer
than our performance requirements.  But atleast for slow spinners the
SSD can make a big impact, presumably if we had faster disk SSD would
have more marginal effects.

-Jon
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Case where a separate Bluestore WAL/DB device crashes...

2018-03-01 Thread Hervé Ballans


Indeed it makes sense, thanks !

And so, just for my own thinking, for the implementation of a new 
Bluestore project, we really have to ask ourselves the question of 
whether separating WAL/DBs significantly increases performance. If the 
WAL/DB are on the same device as the bluestore data device, we only lose 
one OSD at a time...


Can we find recent benchmarks on this performance issue related to the 
location of WAL/DBs ?


Thanks,
Hervé

Le 01/03/2018 à 16:31, David Turner a écrit :
This aspect of osds has not changed from filestore with SSD journals 
to bluestore with DB and WAL soon SSDs. If the SSD fails, all osds 
using it aren't lost and need to be removed from the cluster and 
recreated with a new drive.


You can never guarantee data integrity on bluestore or filestore if 
any media of the osd fails completely.


On Thu, Mar 1, 2018, 10:24 AM Hervé Ballans 
mailto:herve.ball...@ias.u-psud.fr>> wrote:


Hello,

With Bluestore, I have a couple of questions regarding the case of
separate partitions for block.wal and block.db.

Let's take the case of an OSD node that contains several OSDs
(HDDs) and
also contains one SSD drive for storing WAL partitions and an another
one for storing DB partitions. In this configuration, from my
understanding (but I may be wrong), each SSD drive appears as a
SPOF for
the entire node.

For example, what happens if one of the 2 SSD drives crashes (I know,
it's very rare but...) ?

In this case, are the bluestore data on all the OSDs of the same node
also lost ?

I guess so, but as a result, what is the recovery scenario ? Will
it be
necessary to entirely recreate the node (OSDs + block.wal +
block.db) to
rebuild all the replicas from the other nodes on it ?

Thanks in advance,
Hervé

___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph iSCSI is a prank?

2018-03-01 Thread Milanov, Radoslav Nikiforov

Probably priorities have changed since RedHat acquired Ceph/InkTank  ( 
https://www.redhat.com/en/about/press-releases/red-hat-acquire-inktank-provider-ceph
 ) ?
Why support a competing hypervisor ? Long term switching to KVM seems to be the 
solution.

- Rado

From: ceph-users  On Behalf Of Max Cuttins
Sent: Thursday, March 1, 2018 7:27 AM
To: David Turner ; dilla...@redhat.com
Cc: ceph-users 
Subject: Re: [ceph-users] Ceph iSCSI is a prank?

Il 28/02/2018 18:16, David Turner ha scritto:

My thought is that in 4 years you could have migrated to a hypervisor that will 
have better performance into ceph than an added iSCSI layer. I won't deploy VMs 
for ceph on anything that won't allow librbd to work. Anything else is added 
complexity and reduced performance.

You are definitly right: I have to change hypervisor. So Why I didn't do this 
before?
Because both Citrix/Xen and Inktank/Ceph claim that they were ready to add 
support to Xen in 2013!

It was 2013:
XEN claim to support Ceph: 
https://www.citrix.com/blogs/2013/07/08/xenserver-tech-preview-incorporating-ceph-object-stores-is-now-available/
Inktank say the support for Xen was almost ready: 
https://ceph.com/geen-categorie/xenserver-support-for-rbd/

And also iSCSI was close (it was 2014):
https://ceph.com/geen-categorie/updates-to-ceph-tgt-iscsi-support/

So why change Hypervisor if everybody tell you that compatibility is almost 
ready to be deployed?
... but then "just" pass 4 years and both XEN and Ceph never become 
compatibile...

It's obvious that Citrix in not anymore belivable.
However, at least Ceph should have added iSCSI to it's platform during all 
these years.
Ceph is awesome, so why just don't kill all the competitors make it compatible 
even with washingmachine?



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cannot delete a pool

I get:

#ceph daemon mon.0 config set mon_allow_pool_delete true
admin_socket: exception getting command descriptions: [Errno 13]
Permission denied

Il 01/03/2018 14:00, Eugen Block ha scritto:
It's not necessary to restart a mon if you just want to delete a pool,
even if the "not observed" message appears. And I would not recommend
to permanently enable the "easy" way of deleting a pool. If you are
not able to delete the pool after "ceph tell mon ..." try this:

ceph daemon mon. config set mon_allow_pool_delete true

and then retry deleting the pool. This works for me without restarting
any services or changing config files.

Regards

Zitat von Ronny Aasen :

On 01. mars 2018 13:04, Max Cuttins wrote:

I was testing IO and I created a bench pool.

But if I tried to delete I get:

Error EPERM: pool deletion is disabled; you must first set the
mon_allow_pool_delete config option to true before you can destroy a
pool

So I run:

ceph tell mon.\* injectargs '--mon-allow-pool-delete=true'
mon.ceph-node1: injectargs:mon_allow_pool_delete = 'true' (not
observed, change may require restart)
mon.ceph-node2: injectargs:mon_allow_pool_delete = 'true' (not
observed, change may require restart)
mon.ceph-node3: injectargs:mon_allow_pool_delete = 'true' (not
observed, change may require restart)

I restarted all the nodes.
But the flag has not been observed.

Is this the right way to remove a pool?

i think you need to set the option in the ceph.conf of the monitors.
and then restart the mon's one by one.

afaik that is by design.
https://blog.widodh.nl/2015/04/protecting-your-ceph-pools-against-removal-or-property-changes/

kind regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph iSCSI is a prank?


Almost...


Il 01/03/2018 16:17, Heðin Ejdesgaard Møller ha scritto:

Hello,

I would like to point out that we are running ceph+redundant iscsiGW's,
connecting the LUN's to a esxi+vcsa-6.5 cluster with Red Hat support.

We did encountered a few bumps on the road to production, but those got
fixed by Red Hat engineering and are included in the rhel7.5 and 4.17
kernel.

I can recommend having a look at https://github.com/open-iscsi if you
want to contribute on the userspace side.

Regards
Heðin Ejdesgaard
Synack Sp/f

Direct: +298 77 11 12
Phone:  +298 20 11 11
E-Mail: h...@synack.fo


On hós, 2018-03-01 at 13:33 +0100, Kai Wagner wrote:

I totally understand and see your frustration here, but you've to
keep
in mind that this is an Open Source project with a lots of
volunteers.
If you have a really urgent need, you have the possibility to develop
such a feature on your own or you've to buy someone who could do the
work for you.

It's a long journey but it seems like it finally comes to an end.


On 03/01/2018 01:26 PM, Max Cuttins wrote:

It's obvious that Citrix in not anymore belivable.
However, at least Ceph should have added iSCSI to it's platform
during
all these years.
Ceph is awesome, so why just don't kill all the competitors make it
compatible even with washingmachine?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow clients after git pull

2018-03-01 Thread Daniel Carrasco

Hello,

Some data is not in git repository and also needs to be updated on all
servers at same time (uploads...), that's why I'm searching for a
centralized solution.

I think I've found a "patch" to do it... All our server are connected to a
manager, so I've created a task in that managet to stop nginx, umount the
FS, remount the FS and then start the Nginx when the git repository is
deployed. Look like it works as expected and with the cache I'm planing to
add to webpage the impact should be minimal.

Thaks and greetings!!.

2018-03-01 16:28 GMT+01:00 David Turner :

> This removes ceph completely, or any other networked storage, but git has
> triggers. If your website is stopped in git and you just need to make sure
> that nginx always has access to the latest data, just configure git
> triggers to auto-update the repository when there is a commit to the
> repository from elsewhere. This would be on local storage and remove a lot
> of complexity. All front-end servers would update automatically via git.
>
> If something like that doesn't work, it would seem you have a workaround
> that works for you.
>
>
> On Thu, Mar 1, 2018, 10:12 AM Daniel Carrasco 
> wrote:
>
>> Hello,
>>
>> Our problem is that the webpage is on a autoscaling group, so the created
>> machine is not always updated and needs to have the latest data always.
>> I've tried several ways to do it:
>>
>>- Local Storage synced: Sometimes the sync fails and data is not
>>updated
>>- NFS: If NFS server goes down, all clients die
>>- Two NFS Server synced+Monit: when a NFS server is down umount
>>freezes and is not able to change to the other NFS server
>>- GlusterFS: Too slow for webpages
>>
>> CephFS is near to NFS on speed and have auto recovery if one node goes
>> down (clients connects to other MDS automatically).
>>
>> About to use RBD, my problem is that I need a FS, because Nginx is not
>> able to read directly from Ceph in other ways.
>> About S3 and similar, I've also tried AWS NFS method but is much slower
>> (even more than GlusterFS).
>>
>> My problem is that CephFS fits what I need.
>>
>> Doing tests I've noticed that maybe the file is updated on ceph node
>> while client has file sessions open, so until I remount the FS that
>> sessions continue opened. When I open the files with vim I notice that is a
>> bit slower while is updating the repository, but after the update it works
>> as fast as before.
>>
>> It fails even on Jewel so I think that maybe the only way to do it is to
>> create a task to remount the FS when I deploy.
>>
>> Greetings and thanks!!
>>
>>
>> 2018-03-01 15:29 GMT+01:00 David Turner :
>>
>>> Using CephFS for something like this is about the last thing I would
>>> do.  Does it need to be on a networked posix filesystem that can be mounted
>>> on multiple machines at the same time?  If so, then you're kinda stuck and
>>> we can start looking at your MDS hardware and see if there are any MDS
>>> settings that need to be configured differently for this to work.
>>>
>>> If you don't NEED CephFS, then I would recommend utilizing an RBD for
>>> something like this.  Its limitation is only being able to be mapped to 1
>>> server at a time, but that's decent enough for most failover scenarios for
>>> build setups.  If you need to failover, unmap it from the primary and map
>>> it to another server to resume workloads.
>>>
>>> Hosting websites out of CephFS also seems counter-intuitive.  Have you
>>> looked at S3 websites?  RGW supports configuring websites out of a bucket
>>> that might be of interest.  Your RGW daemon configuration could easily
>>> become an HA website with an LB in front of them.
>>>
>>> I'm biased here a bit, but I don't like to use networked filesystems
>>> unless nothing else can be worked out or the software using it is 3rd party
>>> and just doesn't support anything else.
>>>
>>> On Thu, Mar 1, 2018 at 9:05 AM Daniel Carrasco 
>>> wrote:
>>>
 Hello,

 I've tried to change a lot of things on configuration and use ceph-fuse
 but nothing makes it work better... When I deploy the git repository it
 becomes much slower until I remount the FS (just executing systemctl stop
 nginx && umount /mnt/ceph && mount -a && systemctl start nginx). It happen
 when the FS gets a lot of IO because when I execute Rsync I got the same
 problem.

 I'm thinking about to downgrade to a lower version of ceph like for
 example jewel to see if works better. I know that will be deprecated soon,
 but I don't know what other tests I can do...

 Greetings!!

 2018-02-28 17:11 GMT+01:00 Daniel Carrasco :

> Hello,
>
> I've created a Ceph cluster with 3 nodes and a FS to serve a webpage.
> The webpage speed is good enough (near to NFS speed), and have HA if one 
> FS
> die.
> My problem comes when I deploy a git repository on that FS. The server
> makes a lot of IOPS to check the files that have to update

Re: [ceph-users] Case where a separate Bluestore WAL/DB device crashes...

2018-03-01 Thread Caspar Smit

s/aren't/are/  :)



Met vriendelijke groet,

Caspar Smit
Systemengineer
SuperNAS
Dorsvlegelstraat 13
1445 PA Purmerend

t: (+31) 299 410 414
e: caspars...@supernas.eu
w: www.supernas.eu

2018-03-01 16:31 GMT+01:00 David Turner :

> This aspect of osds has not changed from filestore with SSD journals to
> bluestore with DB and WAL soon SSDs. If the SSD fails, all osds using it
> aren't lost and need to be removed from the cluster and recreated with a
> new drive.
>
> You can never guarantee data integrity on bluestore or filestore if any
> media of the osd fails completely.
>
>
> On Thu, Mar 1, 2018, 10:24 AM Hervé Ballans 
> wrote:
>
>> Hello,
>>
>> With Bluestore, I have a couple of questions regarding the case of
>> separate partitions for block.wal and block.db.
>>
>> Let's take the case of an OSD node that contains several OSDs (HDDs) and
>> also contains one SSD drive for storing WAL partitions and an another
>> one for storing DB partitions. In this configuration, from my
>> understanding (but I may be wrong), each SSD drive appears as a SPOF for
>> the entire node.
>>
>> For example, what happens if one of the 2 SSD drives crashes (I know,
>> it's very rare but...) ?
>>
>> In this case, are the bluestore data on all the OSDs of the same node
>> also lost ?
>>
>> I guess so, but as a result, what is the recovery scenario ? Will it be
>> necessary to entirely recreate the node (OSDs + block.wal + block.db) to
>> rebuild all the replicas from the other nodes on it ?
>>
>> Thanks in advance,
>> Hervé
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Case where a separate Bluestore WAL/DB device crashes...

This aspect of osds has not changed from filestore with SSD journals to
bluestore with DB and WAL soon SSDs. If the SSD fails, all osds using it
aren't lost and need to be removed from the cluster and recreated with a
new drive.

You can never guarantee data integrity on bluestore or filestore if any
media of the osd fails completely.

On Thu, Mar 1, 2018, 10:24 AM Hervé Ballans 
wrote:

> Hello,
>
> With Bluestore, I have a couple of questions regarding the case of
> separate partitions for block.wal and block.db.
>
> Let's take the case of an OSD node that contains several OSDs (HDDs) and
> also contains one SSD drive for storing WAL partitions and an another
> one for storing DB partitions. In this configuration, from my
> understanding (but I may be wrong), each SSD drive appears as a SPOF for
> the entire node.
>
> For example, what happens if one of the 2 SSD drives crashes (I know,
> it's very rare but...) ?
>
> In this case, are the bluestore data on all the OSDs of the same node
> also lost ?
>
> I guess so, but as a result, what is the recovery scenario ? Will it be
> necessary to entirely recreate the node (OSDs + block.wal + block.db) to
> rebuild all the replicas from the other nodes on it ?
>
> Thanks in advance,
> Hervé
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow clients after git pull

This removes ceph completely, or any other networked storage, but git has
triggers. If your website is stopped in git and you just need to make sure
that nginx always has access to the latest data, just configure git
triggers to auto-update the repository when there is a commit to the
repository from elsewhere. This would be on local storage and remove a lot
of complexity. All front-end servers would update automatically via git.

If something like that doesn't work, it would seem you have a workaround
that works for you.

On Thu, Mar 1, 2018, 10:12 AM Daniel Carrasco  wrote:

> Hello,
>
> Our problem is that the webpage is on a autoscaling group, so the created
> machine is not always updated and needs to have the latest data always.
> I've tried several ways to do it:
>
>- Local Storage synced: Sometimes the sync fails and data is not
>updated
>- NFS: If NFS server goes down, all clients die
>- Two NFS Server synced+Monit: when a NFS server is down umount
>freezes and is not able to change to the other NFS server
>- GlusterFS: Too slow for webpages
>
> CephFS is near to NFS on speed and have auto recovery if one node goes
> down (clients connects to other MDS automatically).
>
> About to use RBD, my problem is that I need a FS, because Nginx is not
> able to read directly from Ceph in other ways.
> About S3 and similar, I've also tried AWS NFS method but is much slower
> (even more than GlusterFS).
>
> My problem is that CephFS fits what I need.
>
> Doing tests I've noticed that maybe the file is updated on ceph node while
> client has file sessions open, so until I remount the FS that sessions
> continue opened. When I open the files with vim I notice that is a bit
> slower while is updating the repository, but after the update it works as
> fast as before.
>
> It fails even on Jewel so I think that maybe the only way to do it is to
> create a task to remount the FS when I deploy.
>
> Greetings and thanks!!
>
>
> 2018-03-01 15:29 GMT+01:00 David Turner :
>
>> Using CephFS for something like this is about the last thing I would do.
>> Does it need to be on a networked posix filesystem that can be mounted on
>> multiple machines at the same time?  If so, then you're kinda stuck and we
>> can start looking at your MDS hardware and see if there are any MDS
>> settings that need to be configured differently for this to work.
>>
>> If you don't NEED CephFS, then I would recommend utilizing an RBD for
>> something like this.  Its limitation is only being able to be mapped to 1
>> server at a time, but that's decent enough for most failover scenarios for
>> build setups.  If you need to failover, unmap it from the primary and map
>> it to another server to resume workloads.
>>
>> Hosting websites out of CephFS also seems counter-intuitive.  Have you
>> looked at S3 websites?  RGW supports configuring websites out of a bucket
>> that might be of interest.  Your RGW daemon configuration could easily
>> become an HA website with an LB in front of them.
>>
>> I'm biased here a bit, but I don't like to use networked filesystems
>> unless nothing else can be worked out or the software using it is 3rd party
>> and just doesn't support anything else.
>>
>> On Thu, Mar 1, 2018 at 9:05 AM Daniel Carrasco 
>> wrote:
>>
>>> Hello,
>>>
>>> I've tried to change a lot of things on configuration and use ceph-fuse
>>> but nothing makes it work better... When I deploy the git repository it
>>> becomes much slower until I remount the FS (just executing systemctl stop
>>> nginx && umount /mnt/ceph && mount -a && systemctl start nginx). It happen
>>> when the FS gets a lot of IO because when I execute Rsync I got the same
>>> problem.
>>>
>>> I'm thinking about to downgrade to a lower version of ceph like for
>>> example jewel to see if works better. I know that will be deprecated soon,
>>> but I don't know what other tests I can do...
>>>
>>> Greetings!!
>>>
>>> 2018-02-28 17:11 GMT+01:00 Daniel Carrasco :
>>>
 Hello,

 I've created a Ceph cluster with 3 nodes and a FS to serve a webpage.
 The webpage speed is good enough (near to NFS speed), and have HA if one FS
 die.
 My problem comes when I deploy a git repository on that FS. The server
 makes a lot of IOPS to check the files that have to update and then all
 clients starts to have problems to use the FS (it becomes much slower).
 In a normal usage the web takes about 400ms to load, and when the
 problem start it takes more than 3s. To fix the problem I just have to
 remount the FS on clients, but I can't remount the FS on every deploy...

 While is deploying I see how the CPU on MDS is a bit higher, but when
 it ends the CPU usage goes down again, so look like is not a problem of 
 CPU.

 My config file is:
 [global]
 fsid = bf56854..e611c08
 mon_initial_members = fs-01, fs-02, fs-03
 mon_host = 10.50.0.94,10.50.1.216,10.50.2.52
 auth_cluster_required = cephx
>>

[ceph-users] Case where a separate Bluestore WAL/DB device crashes...

2018-03-01 Thread Hervé Ballans


Hello,

With Bluestore, I have a couple of questions regarding the case of 
separate partitions for block.wal and block.db.


Let's take the case of an OSD node that contains several OSDs (HDDs) and 
also contains one SSD drive for storing WAL partitions and an another 
one for storing DB partitions. In this configuration, from my 
understanding (but I may be wrong), each SSD drive appears as a SPOF for 
the entire node.


For example, what happens if one of the 2 SSD drives crashes (I know, 
it's very rare but...) ?


In this case, are the bluestore data on all the OSDs of the same node 
also lost ?


I guess so, but as a result, what is the recovery scenario ? Will it be 
necessary to entirely recreate the node (OSDs + block.wal + block.db) to 
rebuild all the replicas from the other nodes on it ?


Thanks in advance,
Hervé

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph iSCSI is a prank?

2018-03-01 Thread Heðin Ejdesgaard Møller

Hello,

I would like to point out that we are running ceph+redundant iscsiGW's,
connecting the LUN's to a esxi+vcsa-6.5 cluster with Red Hat support.

We did encountered a few bumps on the road to production, but those got
fixed by Red Hat engineering and are included in the rhel7.5 and 4.17
kernel.

I can recommend having a look at https://github.com/open-iscsi if you
want to contribute on the userspace side.

Regards
Heðin Ejdesgaard
Synack Sp/f 

Direct: +298 77 11 12
Phone:  +298 20 11 11
E-Mail: h...@synack.fo


On hós, 2018-03-01 at 13:33 +0100, Kai Wagner wrote:
> I totally understand and see your frustration here, but you've to
> keep
> in mind that this is an Open Source project with a lots of
> volunteers.
> If you have a really urgent need, you have the possibility to develop
> such a feature on your own or you've to buy someone who could do the
> work for you.
> 
> It's a long journey but it seems like it finally comes to an end.
> 
> 
> On 03/01/2018 01:26 PM, Max Cuttins wrote:
> > It's obvious that Citrix in not anymore belivable.
> > However, at least Ceph should have added iSCSI to it's platform
> > during
> > all these years.
> > Ceph is awesome, so why just don't kill all the competitors make it
> > compatible even with washingmachine?
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

signature.asc
Description: This is a digitally signed message part
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow clients after git pull

2018-03-01 Thread Daniel Carrasco

Hello,

Our problem is that the webpage is on a autoscaling group, so the created
machine is not always updated and needs to have the latest data always.
I've tried several ways to do it:

   - Local Storage synced: Sometimes the sync fails and data is not updated
   - NFS: If NFS server goes down, all clients die
   - Two NFS Server synced+Monit: when a NFS server is down umount freezes
   and is not able to change to the other NFS server
   - GlusterFS: Too slow for webpages

CephFS is near to NFS on speed and have auto recovery if one node goes down
(clients connects to other MDS automatically).

About to use RBD, my problem is that I need a FS, because Nginx is not able
to read directly from Ceph in other ways.
About S3 and similar, I've also tried AWS NFS method but is much slower
(even more than GlusterFS).

My problem is that CephFS fits what I need.

Doing tests I've noticed that maybe the file is updated on ceph node while
client has file sessions open, so until I remount the FS that sessions
continue opened. When I open the files with vim I notice that is a bit
slower while is updating the repository, but after the update it works as
fast as before.

It fails even on Jewel so I think that maybe the only way to do it is to
create a task to remount the FS when I deploy.

Greetings and thanks!!


2018-03-01 15:29 GMT+01:00 David Turner :

> Using CephFS for something like this is about the last thing I would do.
> Does it need to be on a networked posix filesystem that can be mounted on
> multiple machines at the same time?  If so, then you're kinda stuck and we
> can start looking at your MDS hardware and see if there are any MDS
> settings that need to be configured differently for this to work.
>
> If you don't NEED CephFS, then I would recommend utilizing an RBD for
> something like this.  Its limitation is only being able to be mapped to 1
> server at a time, but that's decent enough for most failover scenarios for
> build setups.  If you need to failover, unmap it from the primary and map
> it to another server to resume workloads.
>
> Hosting websites out of CephFS also seems counter-intuitive.  Have you
> looked at S3 websites?  RGW supports configuring websites out of a bucket
> that might be of interest.  Your RGW daemon configuration could easily
> become an HA website with an LB in front of them.
>
> I'm biased here a bit, but I don't like to use networked filesystems
> unless nothing else can be worked out or the software using it is 3rd party
> and just doesn't support anything else.
>
> On Thu, Mar 1, 2018 at 9:05 AM Daniel Carrasco 
> wrote:
>
>> Hello,
>>
>> I've tried to change a lot of things on configuration and use ceph-fuse
>> but nothing makes it work better... When I deploy the git repository it
>> becomes much slower until I remount the FS (just executing systemctl stop
>> nginx && umount /mnt/ceph && mount -a && systemctl start nginx). It happen
>> when the FS gets a lot of IO because when I execute Rsync I got the same
>> problem.
>>
>> I'm thinking about to downgrade to a lower version of ceph like for
>> example jewel to see if works better. I know that will be deprecated soon,
>> but I don't know what other tests I can do...
>>
>> Greetings!!
>>
>> 2018-02-28 17:11 GMT+01:00 Daniel Carrasco :
>>
>>> Hello,
>>>
>>> I've created a Ceph cluster with 3 nodes and a FS to serve a webpage.
>>> The webpage speed is good enough (near to NFS speed), and have HA if one FS
>>> die.
>>> My problem comes when I deploy a git repository on that FS. The server
>>> makes a lot of IOPS to check the files that have to update and then all
>>> clients starts to have problems to use the FS (it becomes much slower).
>>> In a normal usage the web takes about 400ms to load, and when the
>>> problem start it takes more than 3s. To fix the problem I just have to
>>> remount the FS on clients, but I can't remount the FS on every deploy...
>>>
>>> While is deploying I see how the CPU on MDS is a bit higher, but when it
>>> ends the CPU usage goes down again, so look like is not a problem of CPU.
>>>
>>> My config file is:
>>> [global]
>>> fsid = bf56854..e611c08
>>> mon_initial_members = fs-01, fs-02, fs-03
>>> mon_host = 10.50.0.94,10.50.1.216,10.50.2.52
>>> auth_cluster_required = cephx
>>> auth_service_required = cephx
>>> auth_client_required = cephx
>>>
>>> public network = 10.50.0.0/22
>>> osd pool default size = 3
>>>
>>> ##
>>> ### OSD
>>> ##
>>> [osd]
>>>   osd_mon_heartbeat_interval = 5
>>>   osd_mon_report_interval_max = 10
>>>   osd_heartbeat_grace = 15
>>>   osd_fast_fail_on_connection_refused = True
>>>   osd_pool_default_pg_num = 128
>>>   osd_pool_default_pgp_num = 128
>>>   osd_pool_default_size = 2
>>>   osd_pool_default_min_size = 2
>>>
>>> ##
>>> ### Monitores
>>> ##
>>> [mon]
>>>   mon_osd_min_down_reporters = 1
>>>
>>> ##
>>> ### MDS
>>> ##
>>> [mds]
>>>   mds_cache_memory_limit = 792723456
>>>   mds_bal_mode = 1
>>>
>>> ##
>>> ### Client
>>> ##
>>> [client]
>>>   client_ca

Re: [ceph-users] Ceph iSCSI is a prank?

2018-03-01 Thread Donny Davis

I wonder when EMC/Netapp are going to start giving away production ready
bits that fit into your architecture

At least support for this feature is coming in the near term.

I say keep on keepin on. Kudos to the ceph team (and maybe more teams) for
taking care of the hard stuff for us.




On Thu, Mar 1, 2018 at 9:42 AM, Samuel Soulard 
wrote:

> Hi Jason,
>
> That's awesome.  Keep up the good work guys, we all love the work you are
> doing with that software!!
>
> Sam
>
> On Mar 1, 2018 09:11, "Jason Dillaman"  wrote:
>
>> It's very high on our priority list to get a solution merged in the
>> upstream kernel. There was a proposal to use DLM to distribute the PGR
>> state between target gateways (a la the SCST target) and it's quite
>> possible that would have the least amount of upstream resistance since
>> it would work for all backends and not just RBD. We, of course, would
>> love to just use the Ceph cluster to distribute the state information
>> instead of requiring a bolt-on DLM (with its STONITH error handling),
>> but we'll take what we can get (merged).
>>
>> I believe SUSE uses a custom downstream kernel that stores the PGR
>> state in the Ceph cluster but requires two round-trips to the cluster
>> for each IO (first to verify the PGR state and the second to perform
>> the IO). The PetaSAN project is built on top of these custom kernel
>> patches as well, I believe.
>>
>> On Thu, Mar 1, 2018 at 8:50 AM, Samuel Soulard 
>> wrote:
>> > On another note, is there any work being done for persistent group
>> > reservations support for Ceph/LIO compatibility? Or just a rough
>> estimate :)
>> >
>> > Would love to see Redhat/Ceph support this type of setup.  I know Suse
>> > supports it as of late.
>> >
>> > Sam
>> >
>> > On Mar 1, 2018 07:33, "Kai Wagner"  wrote:
>> >>
>> >> I totally understand and see your frustration here, but you've to keep
>> >> in mind that this is an Open Source project with a lots of volunteers.
>> >> If you have a really urgent need, you have the possibility to develop
>> >> such a feature on your own or you've to buy someone who could do the
>> >> work for you.
>> >>
>> >> It's a long journey but it seems like it finally comes to an end.
>> >>
>> >>
>> >> On 03/01/2018 01:26 PM, Max Cuttins wrote:
>> >> > It's obvious that Citrix in not anymore belivable.
>> >> > However, at least Ceph should have added iSCSI to it's platform
>> during
>> >> > all these years.
>> >> > Ceph is awesome, so why just don't kill all the competitors make it
>> >> > compatible even with washingmachine?
>> >>
>> >> --
>> >> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
>> HRB
>> >> 21284 (AG Nürnberg)
>> >>
>> >>
>> >>
>> >> ___
>> >> ceph-users mailing list
>> >> ceph-users@lists.ceph.com
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>>
>>
>>
>> --
>> Jason
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph iSCSI is a prank?

2018-03-01 Thread Samuel Soulard

Hi Jason,

That's awesome.  Keep up the good work guys, we all love the work you are
doing with that software!!

Sam

On Mar 1, 2018 09:11, "Jason Dillaman"  wrote:

> It's very high on our priority list to get a solution merged in the
> upstream kernel. There was a proposal to use DLM to distribute the PGR
> state between target gateways (a la the SCST target) and it's quite
> possible that would have the least amount of upstream resistance since
> it would work for all backends and not just RBD. We, of course, would
> love to just use the Ceph cluster to distribute the state information
> instead of requiring a bolt-on DLM (with its STONITH error handling),
> but we'll take what we can get (merged).
>
> I believe SUSE uses a custom downstream kernel that stores the PGR
> state in the Ceph cluster but requires two round-trips to the cluster
> for each IO (first to verify the PGR state and the second to perform
> the IO). The PetaSAN project is built on top of these custom kernel
> patches as well, I believe.
>
> On Thu, Mar 1, 2018 at 8:50 AM, Samuel Soulard 
> wrote:
> > On another note, is there any work being done for persistent group
> > reservations support for Ceph/LIO compatibility? Or just a rough
> estimate :)
> >
> > Would love to see Redhat/Ceph support this type of setup.  I know Suse
> > supports it as of late.
> >
> > Sam
> >
> > On Mar 1, 2018 07:33, "Kai Wagner"  wrote:
> >>
> >> I totally understand and see your frustration here, but you've to keep
> >> in mind that this is an Open Source project with a lots of volunteers.
> >> If you have a really urgent need, you have the possibility to develop
> >> such a feature on your own or you've to buy someone who could do the
> >> work for you.
> >>
> >> It's a long journey but it seems like it finally comes to an end.
> >>
> >>
> >> On 03/01/2018 01:26 PM, Max Cuttins wrote:
> >> > It's obvious that Citrix in not anymore belivable.
> >> > However, at least Ceph should have added iSCSI to it's platform during
> >> > all these years.
> >> > Ceph is awesome, so why just don't kill all the competitors make it
> >> > compatible even with washingmachine?
> >>
> >> --
> >> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
> HRB
> >> 21284 (AG Nürnberg)
> >>
> >>
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
>
> --
> Jason
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] force scrubbing

They added `ceph pg force-backfill ` but there is nothing to force
scrubbing yet aside from the previously mentioned tricks. You should be
able to change osd_max_scrubs around until the PGs you want to scrub are
going.

On Thu, Mar 1, 2018 at 9:30 AM Kenneth Waegeman 
wrote:

> Hi,
>
> Still seeing this on Luminous 12.2.2:
>
> When I do ceph pg deep-scrub on the pg or ceph osd deep-scrub on the
> primary osd, I get the message
>
> instructing pg 5.238 on osd.356 to deep-scrub
>
> But nothing happens on that OSD. I waited a day, but the timestamp I see
> in ceph pg dump hasn't changed.
>
> Any clues?
>
> Thanks!!
>
> K
>
> On 13/11/17 10:01, Kenneth Waegeman wrote:
> > Hi all,
> >
> >
> > Is there a way to force scrub a pg of an erasure coded pool?
> >
> > I tried  ceph pg deep-scrub 5.4c7, but after a week it still hasn't
> > scrubbed the pg (last scrub timestamp not changed)
> >
> > Thanks!
> >
> >
> > Kenneth
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph iSCSI is a prank?

2018-03-01 Thread Ric Wheeler


On 02/28/2018 10:06 AM, Max Cuttins wrote:



Il 28/02/2018 15:19, Jason Dillaman ha scritto:

On Wed, Feb 28, 2018 at 7:53 AM, Massimiliano Cuttini  
wrote:

I was building ceph in order to use with iSCSI.
But I just see from the docs that need:

CentOS 7.5
(which is not available yet, it's still at 7.4)
https://wiki.centos.org/Download

Kernel 4.17
(which is not available yet, it is still at 4.15.7)
https://www.kernel.org/

The necessary kernel changes actually are included as part of 4.16-rc1
which is available now. We also offer a pre-built test kernel with the
necessary fixes here [1].

This is a release candidate and it's not ready for production.
Does anybody know when the kernel 4.16 will be ready for production?


Every user/customer has a different definition of "production" - most enterprise 
users will require their distribution vendor to have this prebuilt into a 
product with commercial support.


If you are looking at using brand new kernels in production for your definition 
of production without vendor support, you need to have the personal expertise 
and staffing required to validate production readiness and carry out support 
yourself.


As others have said, that is the joy of open source - you get to make that call 
on your own, but that does come at a price (spending money for vendor support or 
spending your time and expertise to do it on your own :))


Regards,

Ric







So I guess, there is no ufficial support and this is just a bad prank.

Ceph is ready to be used with S3 since many years.
But need the kernel of the next century to works with such an old technology
like iSCSI.
So sad.

Unfortunately, kernel vs userspace have very different development
timelines.We have no interest in maintaining out-of-tree patchsets to
the kernel.


This is true, but having something that just works in order to have minimum 
compatibility and start to dismiss old disk is something you should think about.
You'll have ages in order to improve and get better performance. But you 
should allow Users to cut-off old solutions as soon as possible while waiting 
for a better implementation.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph iSCSI is a prank?

2018-03-01 Thread David Disseldorp

On Thu, 1 Mar 2018 09:11:21 -0500, Jason Dillaman wrote:

> It's very high on our priority list to get a solution merged in the
> upstream kernel. There was a proposal to use DLM to distribute the PGR
> state between target gateways (a la the SCST target) and it's quite
> possible that would have the least amount of upstream resistance since
> it would work for all backends and not just RBD. We, of course, would
> love to just use the Ceph cluster to distribute the state information
> instead of requiring a bolt-on DLM (with its STONITH error handling),
> but we'll take what we can get (merged).

I'm also very keen on having a proper upstream solution for this. My
preference is still to proceed with PR state backed by Ceph.

> I believe SUSE uses a custom downstream kernel that stores the PGR
> state in the Ceph cluster but requires two round-trips to the cluster
> for each IO (first to verify the PGR state and the second to perform
> the IO). The PetaSAN project is built on top of these custom kernel
> patches as well, I believe.

Maged from PetaSAN added support for rados-notify based PR state
retrieval. Sill, in the end the PR patch-set is too intrusive to make it
upstream, so we need to work on a proper upstreamable solution, with
tcmu-runner or otherwise.

Cheers, David
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] force scrubbing

2018-03-01 Thread Kenneth Waegeman


Hi,

Still seeing this on Luminous 12.2.2:

When I do ceph pg deep-scrub on the pg or ceph osd deep-scrub on the 
primary osd, I get the message


instructing pg 5.238 on osd.356 to deep-scrub

But nothing happens on that OSD. I waited a day, but the timestamp I see 
in ceph pg dump hasn't changed.


Any clues?

Thanks!!

K

On 13/11/17 10:01, Kenneth Waegeman wrote:

Hi all,


Is there a way to force scrub a pg of an erasure coded pool?

I tried  ceph pg deep-scrub 5.4c7, but after a week it still hasn't 
scrubbed the pg (last scrub timestamp not changed)


Thanks!


Kenneth

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow clients after git pull

Using CephFS for something like this is about the last thing I would do.
Does it need to be on a networked posix filesystem that can be mounted on
multiple machines at the same time?  If so, then you're kinda stuck and we
can start looking at your MDS hardware and see if there are any MDS
settings that need to be configured differently for this to work.

If you don't NEED CephFS, then I would recommend utilizing an RBD for
something like this.  Its limitation is only being able to be mapped to 1
server at a time, but that's decent enough for most failover scenarios for
build setups.  If you need to failover, unmap it from the primary and map
it to another server to resume workloads.

Hosting websites out of CephFS also seems counter-intuitive.  Have you
looked at S3 websites?  RGW supports configuring websites out of a bucket
that might be of interest.  Your RGW daemon configuration could easily
become an HA website with an LB in front of them.

I'm biased here a bit, but I don't like to use networked filesystems unless
nothing else can be worked out or the software using it is 3rd party and
just doesn't support anything else.

On Thu, Mar 1, 2018 at 9:05 AM Daniel Carrasco  wrote:

> Hello,
>
> I've tried to change a lot of things on configuration and use ceph-fuse
> but nothing makes it work better... When I deploy the git repository it
> becomes much slower until I remount the FS (just executing systemctl stop
> nginx && umount /mnt/ceph && mount -a && systemctl start nginx). It happen
> when the FS gets a lot of IO because when I execute Rsync I got the same
> problem.
>
> I'm thinking about to downgrade to a lower version of ceph like for
> example jewel to see if works better. I know that will be deprecated soon,
> but I don't know what other tests I can do...
>
> Greetings!!
>
> 2018-02-28 17:11 GMT+01:00 Daniel Carrasco :
>
>> Hello,
>>
>> I've created a Ceph cluster with 3 nodes and a FS to serve a webpage. The
>> webpage speed is good enough (near to NFS speed), and have HA if one FS die.
>> My problem comes when I deploy a git repository on that FS. The server
>> makes a lot of IOPS to check the files that have to update and then all
>> clients starts to have problems to use the FS (it becomes much slower).
>> In a normal usage the web takes about 400ms to load, and when the problem
>> start it takes more than 3s. To fix the problem I just have to remount the
>> FS on clients, but I can't remount the FS on every deploy...
>>
>> While is deploying I see how the CPU on MDS is a bit higher, but when it
>> ends the CPU usage goes down again, so look like is not a problem of CPU.
>>
>> My config file is:
>> [global]
>> fsid = bf56854..e611c08
>> mon_initial_members = fs-01, fs-02, fs-03
>> mon_host = 10.50.0.94,10.50.1.216,10.50.2.52
>> auth_cluster_required = cephx
>> auth_service_required = cephx
>> auth_client_required = cephx
>>
>> public network = 10.50.0.0/22
>> osd pool default size = 3
>>
>> ##
>> ### OSD
>> ##
>> [osd]
>>   osd_mon_heartbeat_interval = 5
>>   osd_mon_report_interval_max = 10
>>   osd_heartbeat_grace = 15
>>   osd_fast_fail_on_connection_refused = True
>>   osd_pool_default_pg_num = 128
>>   osd_pool_default_pgp_num = 128
>>   osd_pool_default_size = 2
>>   osd_pool_default_min_size = 2
>>
>> ##
>> ### Monitores
>> ##
>> [mon]
>>   mon_osd_min_down_reporters = 1
>>
>> ##
>> ### MDS
>> ##
>> [mds]
>>   mds_cache_memory_limit = 792723456
>>   mds_bal_mode = 1
>>
>> ##
>> ### Client
>> ##
>> [client]
>>   client_cache_size = 32768
>>   client_mount_timeout = 30
>>   client_oc_max_objects = 2000
>>   client_oc_size = 629145600
>>   client_permissions = false
>>   rbd_cache = true
>>   rbd_cache_size = 671088640
>>
>> My cluster and clients uses Debian 9 with latest ceph version (12.2.4).
>> The clients uses kernel modules to mount the share, because are a bit
>> faster than fuse modules. The deploy is done on one of the Ceph nodes, that
>> have the FS mounted by kernel module too.
>> My cluster is not a high usage cluster, so have all daemons on one
>> machine (3 machines with OSD, MON, MGR and MDS). All OSD has a copy of the
>> data, only one MGR is active and two of the MDS are active with one on
>> standby. The clients mount the FS using the three MDS IP addresses and just
>> now don't have any request because is not published.
>>
>> Someone knows what can be happening?, because all works fine (even on
>> other cluster I did with an high load), but just deploy the git repository
>> and all start to work very slow.
>>
>> Thanks!!
>>
>>
>> --
>> _
>>
>>   Daniel Carrasco Marín
>>   Ingeniería para la Innovación i2TIC, S.L.
>>   Tlf:  +34 911 12 32 84 Ext: 223
>>   www.i2tic.com
>> _
>>
>
>
>
> --
> _
>
>   Daniel Carrasco Marín
>   Ingeniería para la Innovación i2TIC, S.L.
>   Tlf:  +34 911 12 32 84 Ext: 223
>   www.i2tic.com

Re: [ceph-users] Ceph iSCSI is a prank?

2018-03-01 Thread Federico Lucifredi

Hi Max,

> On Feb 28, 2018, at 10:06 AM, Max Cuttins  wrote:
> 
> This is true, but having something that just works in order to have minimum 
> compatibility and start to dismiss old disk is something you should think 
> about.
> You'll have ages in order to improve and get better performance. But you 
> should allow Users to cut-off old solutions as soon as possible while waiting 
> for a better implementation.

I like your thinking, but I wonder why doesn’t a locally-mounted kRBD volume 
meet this need? It seems easier than iSCSI and I would venture would show twice 
the performance at least in some cases.

ISCSI in ALUA mode may be as close as it gets to scale-out iSCSI in software. 
It is not bad, but you pay for the extra hops in performance and complexity. So 
it totally makes sense where kRBD and libRBD are not (yet) available, like 
WMware and Windows, but not where native drivers are available. 

And about Xen... patches are accepted in this project — folks who really care 
should go out and code it.

Best-F
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Memory leak in Ceph OSD?

With default memory settings, the general rule is 1GB ram/1TB OSD.  If you
have a 4TB OSD, you should plan to have at least 4GB ram.  This was the
recommendation for filestore OSDs, but it was a bit much memory for the
OSDs.  From what I've seen, this rule is a little more appropriate with
bluestore now and should still be observed.

Please note that memory usage in a HEALTH_OK cluster is not the same amount
of memory that the daemons will use during recovery.  I have seen
deployments with 4x memory usage during recovery.

On Thu, Mar 1, 2018 at 8:11 AM Stefan Kooman  wrote:

> Quoting Caspar Smit (caspars...@supernas.eu):
> > Stefan,
> >
> > How many OSD's and how much RAM are in each server?
>
> Currently 7 OSDs, 128 GB RAM. Max wil be 10 OSDs in these servers. 12
> cores (at least one core per OSD).
>
> > bluestore_cache_size=6G will not mean each OSD is using max 6GB RAM
> right?
>
> Apparently. Sure they will use more RAM than just cache to function
> correctly. I figured 3 GB per OSD would be enough ...
>
> > Our bluestore hdd OSD's with bluestore_cache_size at 1G use ~4GB of total
> > RAM. The cache is a part of the memory usage by bluestore OSD's.
>
> A factor 4 is quite high, isn't it? Where is all this RAM used for
> besides cache? RocksDB?
>
> So how should I size the amount of RAM in a OSD server for 10 bluestore
> SSDs in a
> replicated setup?
>
> Thanks,
>
> Stefan
>
> --
> | BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
> | GPG: 0xD14839C6   +31 318 648 688
> <+31%20318%20648%20688> / i...@bit.nl
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Luminous Watch Live Cluster Changes Problem

`ceph pg stat` might be cleaner to watch than the `ceph status | grep
pgs`.  I also like watching `ceph osd pool stats` which breaks down all IO
by pool.  You also have the option of the dashboard mgr service which has a
lot of useful information including the pool IO breakdown.

On Thu, Mar 1, 2018 at 7:22 AM Georgios Dimitrakakis 
wrote:

>
>  Excellent! Good to know that the behavior is intentional!
>
>  Thanks a lot John for the feedback!
>
>  Best regards,
>
>  G.
>
> > On Thu, Mar 1, 2018 at 12:03 PM, Georgios Dimitrakakis
> >  wrote:
> >> I have recently updated to Luminous (12.2.4) and I have noticed that
> >> using
> >> "ceph -w" only produces an initial output like the one below but
> >> never gets
> >> updated afterwards. Is this a feature because I was used to the old
> >> way that
> >> was constantly
> >> producing info.
> >
> > It's intentional.  "ceph -w" is the command that follows the Ceph
> > cluster log.  The monitor used to dump the pg status into the cluster
> > log every 5 seconds, which was useful sometimes, but also made the
> > log
> > pretty unreadable for anything else, because other output was quickly
> > swamped with the pg status spam.
> >
> > To replicate the de-facto old behaviour (print the pg status every 5
> > seconds), you can always do something like `watch -n1 "ceph status |
> > grep pgs"`
> >
> > There's work ongoing to create a nice replacement that does a status
> > stream without spamming the cluster log to accomplish it here:
> > https://github.com/ceph/ceph/pull/20100
> >
> > Cheers,
> > John
> >
> >>
> >> Here is what I get as initial output which is not updated:
> >>
> >> $ ceph -w
> >>   cluster:
> >> id: d357a551-5b7a-4501-8d8f-009c63b2c972
> >> health: HEALTH_OK
> >>
> >>   services:
> >> mon: 1 daemons, quorum node1
> >> mgr: node1(active)
> >> osd: 2 osds: 2 up, 2 in
> >> rgw: 1 daemon active
> >>
> >>   data:
> >> pools:   11 pools, 152 pgs
> >> objects: 9786 objects, 33754 MB
> >> usage:   67494 MB used, 3648 GB / 3714 GB avail
> >> pgs: 152 active+clean
> >>
> >>
> >>
> >> Even if I create a new volume in my Openstack installation, assign
> >> it to a
> >> VM, mount it and format it, I have to stop and re-execute the "ceph
> >> -w"
> >> command to see the following line:
> >>
> >>
> >>   io:
> >> client:   767 B/s rd, 511 B/s wr, 0 op/s rd, 0 op/s wr
> >>
> >> which also pauses after the first display.
> >>
> >>
> >> Kind regards,
> >>
> >>
> >> G.
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph iSCSI is a prank?

2018-03-01 Thread Jason Dillaman

It's very high on our priority list to get a solution merged in the
upstream kernel. There was a proposal to use DLM to distribute the PGR
state between target gateways (a la the SCST target) and it's quite
possible that would have the least amount of upstream resistance since
it would work for all backends and not just RBD. We, of course, would
love to just use the Ceph cluster to distribute the state information
instead of requiring a bolt-on DLM (with its STONITH error handling),
but we'll take what we can get (merged).

I believe SUSE uses a custom downstream kernel that stores the PGR
state in the Ceph cluster but requires two round-trips to the cluster
for each IO (first to verify the PGR state and the second to perform
the IO). The PetaSAN project is built on top of these custom kernel
patches as well, I believe.

On Thu, Mar 1, 2018 at 8:50 AM, Samuel Soulard  wrote:
> On another note, is there any work being done for persistent group
> reservations support for Ceph/LIO compatibility? Or just a rough estimate :)
>
> Would love to see Redhat/Ceph support this type of setup.  I know Suse
> supports it as of late.
>
> Sam
>
> On Mar 1, 2018 07:33, "Kai Wagner"  wrote:
>>
>> I totally understand and see your frustration here, but you've to keep
>> in mind that this is an Open Source project with a lots of volunteers.
>> If you have a really urgent need, you have the possibility to develop
>> such a feature on your own or you've to buy someone who could do the
>> work for you.
>>
>> It's a long journey but it seems like it finally comes to an end.
>>
>>
>> On 03/01/2018 01:26 PM, Max Cuttins wrote:
>> > It's obvious that Citrix in not anymore belivable.
>> > However, at least Ceph should have added iSCSI to it's platform during
>> > all these years.
>> > Ceph is awesome, so why just don't kill all the competitors make it
>> > compatible even with washingmachine?
>>
>> --
>> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB
>> 21284 (AG Nürnberg)
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow clients after git pull

2018-03-01 Thread Daniel Carrasco

Hello,

I've tried to change a lot of things on configuration and use ceph-fuse but
nothing makes it work better... When I deploy the git repository it becomes
much slower until I remount the FS (just executing systemctl stop nginx &&
umount /mnt/ceph && mount -a && systemctl start nginx). It happen when the
FS gets a lot of IO because when I execute Rsync I got the same problem.

I'm thinking about to downgrade to a lower version of ceph like for example
jewel to see if works better. I know that will be deprecated soon, but I
don't know what other tests I can do...

Greetings!!

2018-02-28 17:11 GMT+01:00 Daniel Carrasco :

> Hello,
>
> I've created a Ceph cluster with 3 nodes and a FS to serve a webpage. The
> webpage speed is good enough (near to NFS speed), and have HA if one FS die.
> My problem comes when I deploy a git repository on that FS. The server
> makes a lot of IOPS to check the files that have to update and then all
> clients starts to have problems to use the FS (it becomes much slower).
> In a normal usage the web takes about 400ms to load, and when the problem
> start it takes more than 3s. To fix the problem I just have to remount the
> FS on clients, but I can't remount the FS on every deploy...
>
> While is deploying I see how the CPU on MDS is a bit higher, but when it
> ends the CPU usage goes down again, so look like is not a problem of CPU.
>
> My config file is:
> [global]
> fsid = bf56854..e611c08
> mon_initial_members = fs-01, fs-02, fs-03
> mon_host = 10.50.0.94,10.50.1.216,10.50.2.52
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
>
> public network = 10.50.0.0/22
> osd pool default size = 3
>
> ##
> ### OSD
> ##
> [osd]
>   osd_mon_heartbeat_interval = 5
>   osd_mon_report_interval_max = 10
>   osd_heartbeat_grace = 15
>   osd_fast_fail_on_connection_refused = True
>   osd_pool_default_pg_num = 128
>   osd_pool_default_pgp_num = 128
>   osd_pool_default_size = 2
>   osd_pool_default_min_size = 2
>
> ##
> ### Monitores
> ##
> [mon]
>   mon_osd_min_down_reporters = 1
>
> ##
> ### MDS
> ##
> [mds]
>   mds_cache_memory_limit = 792723456
>   mds_bal_mode = 1
>
> ##
> ### Client
> ##
> [client]
>   client_cache_size = 32768
>   client_mount_timeout = 30
>   client_oc_max_objects = 2000
>   client_oc_size = 629145600
>   client_permissions = false
>   rbd_cache = true
>   rbd_cache_size = 671088640
>
> My cluster and clients uses Debian 9 with latest ceph version (12.2.4).
> The clients uses kernel modules to mount the share, because are a bit
> faster than fuse modules. The deploy is done on one of the Ceph nodes, that
> have the FS mounted by kernel module too.
> My cluster is not a high usage cluster, so have all daemons on one machine
> (3 machines with OSD, MON, MGR and MDS). All OSD has a copy of the data,
> only one MGR is active and two of the MDS are active with one on standby.
> The clients mount the FS using the three MDS IP addresses and just now
> don't have any request because is not published.
>
> Someone knows what can be happening?, because all works fine (even on
> other cluster I did with an high load), but just deploy the git repository
> and all start to work very slow.
>
> Thanks!!
>
>
> --
> _
>
>   Daniel Carrasco Marín
>   Ingeniería para la Innovación i2TIC, S.L.
>   Tlf:  +34 911 12 32 84 Ext: 223
>   www.i2tic.com
> _
>



-- 
_

  Daniel Carrasco Marín
  Ingeniería para la Innovación i2TIC, S.L.
  Tlf:  +34 911 12 32 84 Ext: 223
  www.i2tic.com
_
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph and multiple RDMA NICs

There has been some chatter on the ML questioning the need to separate out
the public and private subnets for Ceph. The trend seems to be in
simplifying your configuration which for some is not specifying multiple
subnets here.  I haven't heard of anyone complaining about network problems
with putting private and public on the same subnets, but I have seen a lot
of people with networking problems by splitting them up.

Personally I use vlans for the 2 on the same interface at home and I have 4
port 10Gb nics at the office, so we split that up as well, but even there
we might be better suited with bonding all 4 together and using a vlan to
split traffic.  I wouldn't merge them together since we have graphing on
our storage nodes for public and private networks.

But the take-away is that if it's too hard to split your public and private
subnets... don't.  I doubt you would notice any difference if you were to
get it working vs just not doing it.

On Thu, Mar 1, 2018 at 3:24 AM Justinas LINGYS 
wrote:

> Hi all,
>
> I am running a small Ceph cluster  (1 MON and 3OSDs), and it works fine.
> However, I have a doubt about the two networks (public and cluster) that
> an OSD uses.
> There is a reference from Mellanox (
> https://community.mellanox.com/docs/DOC-2721) how to configure
> 'ceph.conf'. However, after reading the source code (luminous-stable), I
> get a feeling that we cannot run Ceph with two NICs/Ports as we only have
> one 'ms_async_rdma_local_gid' per OSD, and it seems that the source code
> only uses one option (NIC). I would like to ask how I could communicate
> with the public network via one RDMA NIC and communicate  with the cluster
> network via another RDMA NIC (apply RoCEV2 to both NICs). Since gids are
> unique within a machine, how can I use two different gids in 'ceph.conf'?
>
> Justin
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph iSCSI is a prank?

2018-03-01 Thread Samuel Soulard

On another note, is there any work being done for persistent group
reservations support for Ceph/LIO compatibility? Or just a rough estimate :)

Would love to see Redhat/Ceph support this type of setup.  I know Suse
supports it as of late.

Sam

On Mar 1, 2018 07:33, "Kai Wagner"  wrote:

> I totally understand and see your frustration here, but you've to keep
> in mind that this is an Open Source project with a lots of volunteers.
> If you have a really urgent need, you have the possibility to develop
> such a feature on your own or you've to buy someone who could do the
> work for you.
>
> It's a long journey but it seems like it finally comes to an end.
>
>
> On 03/01/2018 01:26 PM, Max Cuttins wrote:
> > It's obvious that Citrix in not anymore belivable.
> > However, at least Ceph should have added iSCSI to it's platform during
> > all these years.
> > Ceph is awesome, so why just don't kill all the competitors make it
> > compatible even with washingmachine?
>
> --
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB
> 21284 (AG Nürnberg)
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-deploy won't install luminous (but Jewel instead)

You mean documentation like `ceph-deploy --help` or `man ceph-deploy` or
the [1] online documentation? Spoiler, they all document and explain what
`--release` does. I do agree that the [2] documentation talking about
deploying a luminous cluster should mention it if jewel was left the
default installation on purpose.  I'm guessing that was an oversight as
Luminous has been considered stable since 12.2.0. It will likely be fixed
now that it's brought up but the page talking about deploying would do well
to have a note about being able to choose the release you want for
simplicity.

As a side note, it is possible for any one of us to make change requests to
the documentation as this is an open source project. I have a goal to be
more proactive with taking care of goofs, mistakes, and vague parts of the
documentation as I find them. I'll try to do a write-up for others to
easily get involved in this as well.

[1] http://docs.ceph.com/docs/luminous/man/8/ceph-deploy/
[2] http://docs.ceph.com/docs/luminous/rados/deployment/ceph-deploy-new/

On Thu, Mar 1, 2018, 7:06 AM Max Cuttins  wrote:

> Ah!
> So you think this is done by design?
>
> However that command is very very very usefull.
> Please add that to documentation.
> Next time it will save me 2/3 hours.
>
>
>
> Il 01/03/2018 06:12, Sébastien VIGNERON ha scritto:
>
> Hi Max,
>
> I had the same issue (under Ubuntu 1*6*.04) but I have read the
> ceph-deploy 2.0.0 source code and saw a "—-release" flag for the install
> subcommand. You can found the flag with the following command: ceph-deploy
> install --help
>
> It looks like the culprit part of ceph-deploy can be found around line 20
> of /usr/lib/python2.7/dist-packages/ceph_deploy/install.py:
>
> …
> 14 def sanitize_args(args):
> 15"""
> 16args may need a bunch of logic to set proper defaults that
> argparse is
> 17not well suited for.
> 18"""
> 19if args.release is None:
> 20args.release = 'jewel'
> 21args.default_release = True
> 22
> 23# XXX This whole dance is because --stable is getting deprecated
> 24if args.stable is not None:
> 25LOG.warning('the --stable flag is deprecated, use --release
> instead')
> 26args.release = args.stable
> 27# XXX Tango ends here.
> 28
> 29return args
>
> …
>
> Which means we now have to specify "—-release luminous" when we want to
> install a luminous cluster, at least until luminous is considered stable
> and the ceph-deploy tool is changed.
> I think it may be a Kernel version consideration: not all distro have the
> needed minimum version of the kernel (and features) for a full use of
> luminous.
>
> Cordialement / Best regards,
>
> Sébastien VIGNERON
> CRIANN,
> Ingénieur / Engineer
> Technopôle du Madrillet
> 745, avenue de l'Université
> 76800 Saint-Etienne du Rouvray - France
> tél. +33 2 32 91 42 91 <+33%202%2032%2091%2042%2091>
> fax. +33 2 32 91 42 92 <+33%202%2032%2091%2042%2092>
> http://www.criann.fr
> mailto:sebastien.vigne...@criann.fr 
> support: supp...@criann.fr
>
> Le 1 mars 2018 à 00:37, Max Cuttins  a écrit :
>
> Didn't check at time.
>
> I deployed everything from VM standalone.
> The VM was just build up with fresh new centOS7.4 using minimal
> installation ISO1708.
> It's a completly new/fresh/empty system.
> Then I run:
>
> yum update -y
> yum install wget zip unzip vim pciutils -y
> yum install epel-release -y
> yum update -y
> yum install ceph-deploy -y
> yum install yum-plugin-priorities -y
>
> it installed:
>
> Feb 27 19:24:47 Installed: ceph-deploy-1.5.37-0.noarch
>
> -> install ceph with ceph-deploy on 3 nodes.
>
> As a result I get Jewel.
>
> Then... I purge everything from all the 3 nodes
> yum update again on ceph deployer node and get:
>
> Feb 27 20:33:20 Updated: ceph-deploy-2.0.0-0.noarch
>
> ... then I tried to reinstall over and over but I always get Jewel.
> I tryed to install after removed .ceph file config in my homedir.
> I tryed to install after change default repo to repo-luminous
> ... got always Jewel.
>
> Only force the release in the ceph-deploy command allow me to install
> luminous.
>
> Probably yum-plugin-priorities should not be installed after ceph-deploy
> even if I didn't run still any command.
> But what is so strange is that purge and reinstall everything will always
> reinstall Jewel.
> It seems that some lock file has been write somewhere to use Jewel.
>
>
>
> Il 28/02/2018 22:08, David Turner ha scritto:
>
> Which version of ceph-deploy are you using?
>
> On Wed, Feb 28, 2018 at 4:37 AM Massimiliano Cuttini 
> wrote:
>
>> This worked.
>>
>> However somebody should investigate why default is still jewel on Centos
>> 7.4
>>
>> Il 28/02/2018 00:53, jorpilo ha scritto:
>>
>> Try using:
>> ceph-deploy --release luminous host1...
>>
>>  Mensaje original 
>> De: Massimiliano Cuttini  
>> Fecha: 28/2/18 12:42 a. m. (GMT+01:00)
>> Para: ceph-users@lists.ceph.com
>> Asunto:

Re: [ceph-users] Memory leak in Ceph OSD?

2018-03-01 Thread Stefan Kooman

Quoting Caspar Smit (caspars...@supernas.eu):
> Stefan,
> 
> How many OSD's and how much RAM are in each server?

Currently 7 OSDs, 128 GB RAM. Max wil be 10 OSDs in these servers. 12
cores (at least one core per OSD).

> bluestore_cache_size=6G will not mean each OSD is using max 6GB RAM right?

Apparently. Sure they will use more RAM than just cache to function
correctly. I figured 3 GB per OSD would be enough ...

> Our bluestore hdd OSD's with bluestore_cache_size at 1G use ~4GB of total
> RAM. The cache is a part of the memory usage by bluestore OSD's.

A factor 4 is quite high, isn't it? Where is all this RAM used for
besides cache? RocksDB?

So how should I size the amount of RAM in a OSD server for 10 bluestore SSDs in 
a
replicated setup?

Thanks,

Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cannot delete a pool

2018-03-01 Thread Eugen Block

It's not necessary to restart a mon if you just want to delete a pool,
even if the "not observed" message appears. And I would not recommend
to permanently enable the "easy" way of deleting a pool. If you are
not able to delete the pool after "ceph tell mon ..." try this:

ceph daemon mon. config set mon_allow_pool_delete true

and then retry deleting the pool. This works for me without restarting
any services or changing config files.

Regards

Zitat von Ronny Aasen :

On 01. mars 2018 13:04, Max Cuttins wrote:

I was testing IO and I created a bench pool.

But if I tried to delete I get:

Error EPERM: pool deletion is disabled; you must first set the
mon_allow_pool_delete config option to true before you can destroy a
pool

So I run:

I restarted all the nodes.
But the flag has not been observed.

Is this the right way to remove a pool?

i think you need to set the option in the ceph.conf of the monitors.
and then restart the mon's one by one.

afaik that is by design.
https://blog.widodh.nl/2015/04/protecting-your-ceph-pools-against-removal-or-property-changes/

kind regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Eugen Block voice : +49-40-559 51 75
NDE Netzdesign und -entwicklung AG fax : +49-40-559 51 77
Postfach 61 03 15
D-22423 Hamburg e-mail : ebl...@nde.ag

Vorsitzende des Aufsichtsrates: Angelika Mozdzen
Sitz und Registergericht: Hamburg, HRB 90934
Vorstand: Jens-U. Mozdzen
USt-IdNr. DE 814 013 983

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cannot delete a pool

2018-03-01 Thread Ronny Aasen


On 01. mars 2018 13:04, Max Cuttins wrote:

I was testing IO and I created a bench pool.

But if I tried to delete I get:

Error EPERM: pool deletion is disabled; you must first set the
mon_allow_pool_delete config option to true before you can destroy a
pool

So I run:

ceph tell mon.\* injectargs '--mon-allow-pool-delete=true'
mon.ceph-node1: injectargs:mon_allow_pool_delete = 'true' (not
observed, change may require restart)
mon.ceph-node2: injectargs:mon_allow_pool_delete = 'true' (not
observed, change may require restart)
mon.ceph-node3: injectargs:mon_allow_pool_delete = 'true' (not
observed, change may require restart)

I restarted all the nodes.
But the flag has not been observed.

Is this the right way to remove a pool?


i think you need to set the option in the ceph.conf of the monitors.
and then restart the mon's one by one.

afaik that is by design.
https://blog.widodh.nl/2015/04/protecting-your-ceph-pools-against-removal-or-property-changes/

kind regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cannot delete a pool

2018-03-01 Thread Chengguang Xu

> 
> 在 2018年3月1日，下午8:04，Max Cuttins  写道：
> 
> I was testing IO and I created a bench pool.
> 
> But if I tried to delete I get:
> Error EPERM: pool deletion is disabled; you must first set the 
> mon_allow_pool_delete config option to true before you can destroy a pool
> 
> So I run:
> 
> ceph tell mon.\* injectargs '--mon-allow-pool-delete=true'
> mon.ceph-node1: injectargs:mon_allow_pool_delete = 'true' (not observed, 
> change may require restart)
> mon.ceph-node2: injectargs:mon_allow_pool_delete = 'true' (not observed, 
> change may require restart)
> mon.ceph-node3: injectargs:mon_allow_pool_delete = 'true' (not observed, 
> change may require restart)
> 
> I restarted all the nodes.
> But the flag has not been observed.
> 
> Is this the right way to remove a pool?


IIRC, “not observed” means the option value could not dynamically get changed 
during online. 
So you should set the option in the config file and restart monitor nodes.


Thanks,
Chengguang.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph iSCSI is a prank?

2018-03-01 Thread Kai Wagner

I totally understand and see your frustration here, but you've to keep
in mind that this is an Open Source project with a lots of volunteers.
If you have a really urgent need, you have the possibility to develop
such a feature on your own or you've to buy someone who could do the
work for you.

It's a long journey but it seems like it finally comes to an end.

On 03/01/2018 01:26 PM, Max Cuttins wrote:
> It's obvious that Citrix in not anymore belivable.
> However, at least Ceph should have added iSCSI to it's platform during
> all these years.
> Ceph is awesome, so why just don't kill all the competitors make it
> compatible even with washingmachine?

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)

signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph iSCSI is a prank?


Il 28/02/2018 18:16, David Turner ha scritto:
My thought is that in 4 years you could have migrated to a hypervisor 
that will have better performance into ceph than an added iSCSI layer. 
I won't deploy VMs for ceph on anything that won't allow librbd to 
work. Anything else is added complexity and reduced performance.




You are definitly right: I have to change hypervisor. So Why I didn't do 
this before?
Because both Citrix/Xen and Inktank/Ceph claim that they were ready to 
add support to Xen in _*2013*_!


It was 2013:
XEN claim to support Ceph: 
https://www.citrix.com/blogs/2013/07/08/xenserver-tech-preview-incorporating-ceph-object-stores-is-now-available/
Inktank say the support for Xen was almost ready: 
https://ceph.com/geen-categorie/xenserver-support-for-rbd/


And also iSCSI was close (it was 2014):
https://ceph.com/geen-categorie/updates-to-ceph-tgt-iscsi-support/

So why change Hypervisor if everybody tell you that compatibility is 
almost ready to be deployed?
... but then "just" pass 4 years and both XEN and Ceph never become 
compatibile...


It's obvious that Citrix in not anymore belivable.
However, at least Ceph should have added iSCSI to it's platform during 
all these years.
Ceph is awesome, so why just don't kill all the competitors make it 
compatible even with washingmachine?





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Luminous Watch Live Cluster Changes Problem

2018-03-01 Thread Georgios Dimitrakakis



Excellent! Good to know that the behavior is intentional!

Thanks a lot John for the feedback!

Best regards,

G.


On Thu, Mar 1, 2018 at 12:03 PM, Georgios Dimitrakakis
 wrote:
I have recently updated to Luminous (12.2.4) and I have noticed that 
using
"ceph -w" only produces an initial output like the one below but 
never gets
updated afterwards. Is this a feature because I was used to the old 
way that

was constantly
producing info.


It's intentional.  "ceph -w" is the command that follows the Ceph
cluster log.  The monitor used to dump the pg status into the cluster
log every 5 seconds, which was useful sometimes, but also made the 
log

pretty unreadable for anything else, because other output was quickly
swamped with the pg status spam.

To replicate the de-facto old behaviour (print the pg status every 5
seconds), you can always do something like `watch -n1 "ceph status |
grep pgs"`

There's work ongoing to create a nice replacement that does a status
stream without spamming the cluster log to accomplish it here:
https://github.com/ceph/ceph/pull/20100

Cheers,
John



Here is what I get as initial output which is not updated:

$ ceph -w
  cluster:
id: d357a551-5b7a-4501-8d8f-009c63b2c972
health: HEALTH_OK

  services:
mon: 1 daemons, quorum node1
mgr: node1(active)
osd: 2 osds: 2 up, 2 in
rgw: 1 daemon active

  data:
pools:   11 pools, 152 pgs
objects: 9786 objects, 33754 MB
usage:   67494 MB used, 3648 GB / 3714 GB avail
pgs: 152 active+clean



Even if I create a new volume in my Openstack installation, assign 
it to a
VM, mount it and format it, I have to stop and re-execute the "ceph 
-w"

command to see the following line:


  io:
client:   767 B/s rd, 511 B/s wr, 0 op/s rd, 0 op/s wr

which also pauses after the first display.


Kind regards,


G.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Ceph-announce] Luminous v12.2.4 released

2018-03-01 Thread Abhishek Lekshmanan

Jaroslaw Owsiewski  writes:

> What about this: https://tracker.ceph.com/issues/22015#change-105987 ?

Still has to wait for 12.2.5 unfortunately. We only had some critical
build/ceph-disk and whatever prs had already passed QE post 12.2.3 in
12.2.4.  
>
> Regards
>
> -- 
> Jarek
>
> 2018-02-28 16:46 GMT+01:00 Abhishek Lekshmanan :
>
>>
>> This is the fourth bugfix release of Luminous v12.2.x long term stable
>> release series. This was primarily intended to fix a few build,
>> ceph-volume/ceph-disk issues from 12.2.3 and a few RGW issues. We
>> recommend all the users of 12.2.x series to update. A full changelog is
>> also published at the official release blog at
>> https://ceph.com/releases/v12-2-4-luminous-released/
>>
>> Notable Changes
>> ---
>> * cmake: check bootstrap.sh instead before downloading boost (issue#23071,
>> pr#20515, Kefu Chai)
>> * core: Backport of cache manipulation: issues #22603 and #22604
>> (issue#22604, issue#22603, pr#20353, Adam C. Emerson)
>> * core: Snapset inconsistency is detected with its own error (issue#22996,
>> pr#20501, David Zafman)
>> * tools: ceph-objectstore-tool: "$OBJ get-omaphdr" and "$OBJ list-omap"
>> scan all pgs instead of using specific pg (issue#21327, pr#20283, David
>> Zafman)
>> * ceph-volume: warn on mix of filestore and bluestore flags (issue#23003,
>> pr#20568, Alfredo Deza)
>> * ceph-volume: adds support to zap encrypted devices (issue#22878,
>> pr#20545, Andrew Schoen)
>> * ceph-volume: log the current running command for easier debugging
>> (issue#23004, pr#20597, Andrew Schoen)
>> * core: last-stat-seq returns 0 because osd stats are cleared
>> (issue#23093, pr#20548, Sage Weil, David Zafman)
>> * rgw:  make init env methods return an error (issue#23039, pr#20564,
>> Abhishek Lekshmanan)
>> * rgw: URL-decode S3 and Swift object-copy URLs (issue#22121, issue#22729,
>> pr#20236, Malcolm Lee, Matt Benjamin)
>> * rgw: parse old rgw_obj with namespace correctly (issue#22982, pr#20566,
>> Yehuda Sadeh)
>> * rgw: return valid Location element, CompleteMultipartUpload
>> (issue#22655, pr#20266, Matt Benjamin)
>> * rgw: use explicit index pool placement (issue#22928, pr#20565, Yehuda
>> Sadeh)
>> * tools: ceph-disk: v12.2.2 unable to create bluestore osd using ceph-disk
>> (issue#22354, pr#20563, Kefu Chai)
>>
>> Getting Ceph
>> 
>> * Git at git://github.com/ceph/ceph.git
>> * Tarball at http://download.ceph.com/tarballs/ceph-12.2.4.tar.gz
>> * For packages, see http://docs.ceph.com/docs/master/install/get-packages/
>> * For ceph-deploy, see http://docs.ceph.com/docs/
>> master/install/install-ceph-deploy
>> * Release git sha1: 52085d5249a80c5f5121a76d6288429f35e4e77b
>>
>> --
>> Abhishek Lekshmanan
>> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
>> HRB 21284 (AG Nürnberg)
>> ___
>> Ceph-announce mailing list
>> ceph-annou...@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-announce-ceph.com
>>

-- 
Abhishek Lekshmanan
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284 (AG Nürnberg)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph mgr balancer bad distribution

On Thu, Mar 1, 2018 at 1:08 PM, Stefan Priebe - Profihost AG
 wrote:
> nice thanks will try that soon.
>
> Can you tell me how to change the log lever to info for the balancer module?

debug mgr = 4/5

-- dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph iSCSI is a prank?


Xen by Citrix used to be a very good hypervisor.
However they used very old kernel till the 7.1

The distribution doesn't allow you to add package from yum. So you need 
to hack it.

I have helped to develop the installer of the not ufficial plugin:
https://github.com/rposudnevskiy/RBDSR

However I still don't feel safe using that in production.
So I need to fall back to iSCSI.



Il 28/02/2018 20:16, Mark Schouten ha scritto:

Does Xen still not support RBD? Ceph has been around for years now!

Met vriendelijke groeten,

--
Kerio Operator in de Cloud? https://www.kerioindecloud.nl/
Mark Schouten | Tuxis Internet Engineering
KvK: 61527076 | http://www.tuxis.nl/
T: 0318 200208 | i...@tuxis.nl



*Van: * Massimiliano Cuttini 
*Aan: * "ceph-users@lists.ceph.com" 
*Verzonden: * 28-2-2018 13:53
*Onderwerp: * [ceph-users] Ceph iSCSI is a prank?

I was building ceph in order to use with iSCSI.
But I just see from the docs that need:

*CentOS 7.5*
(which is not available yet, it's still at 7.4)
https://wiki.centos.org/Download

*Kernel 4.17*
(which is not available yet, it is still at 4.15.7)
https://www.kernel.org/

So I guess, there is no ufficial support and this is just a bad prank.

Ceph is ready to be used with S3 since many years.
But need the kernel of the next century to works with such an old
technology like iSCSI.
So sad.





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph mgr balancer bad distribution

nice thanks will try that soon.

Can you tell me how to change the log lever to info for the balancer module?

Am 01.03.2018 um 11:30 schrieb Dan van der Ster:
> On Thu, Mar 1, 2018 at 10:40 AM, Dan van der Ster  wrote:
>> On Thu, Mar 1, 2018 at 10:38 AM, Dan van der Ster  
>> wrote:
>>> On Thu, Mar 1, 2018 at 10:24 AM, Stefan Priebe - Profihost AG
>>>  wrote:

 Am 01.03.2018 um 09:58 schrieb Dan van der Ster:
> On Thu, Mar 1, 2018 at 9:52 AM, Stefan Priebe - Profihost AG
>  wrote:
>> Hi,
>>
>> Am 01.03.2018 um 09:42 schrieb Dan van der Ster:
>>> On Thu, Mar 1, 2018 at 9:31 AM, Stefan Priebe - Profihost AG
>>>  wrote:
 Hi,
 Am 01.03.2018 um 09:03 schrieb Dan van der Ster:
> Is the score improving?
>
> ceph balancer eval
>
> It should be decreasing over time as the variances drop toward zero.
>
> You mentioned a crush optimize code at the beginning... how did that
> leave your cluster? The mgr balancer assumes that the crush weight of
> each OSD is equal to its size in TB.
> Do you have any osd reweights? crush-compat will gradually adjust
> those back to 1.0.

 I reweighted them all back to their correct weight.

 Now the mgr balancer module says:
 mgr[balancer] Failed to find further optimization, score 0.010646

 But as you can see it's heavily imbalanced:


 Example:
 49   ssd 0.84000  1.0   864G   546G   317G 63.26 1.13  49

 vs:

 48   ssd 0.84000  1.0   864G   397G   467G 45.96 0.82  49

 45% usage vs. 63%
>>>
>>> Ahh... but look, the num PGs are perfectly balanced, which implies
>>> that you have a relatively large number of empty PGs.
>>>
>>> But regardless, this is annoying and I expect lots of operators to get
>>> this result. (I've also observed that the num PGs is gets balanced
>>> perfectly at the expense of the other score metrics.)
>>>
>>> I was thinking of a patch around here [1] that lets operators add a
>>> score weight on pgs, objects, bytes so we can balance how we like.
>>>
>>> Spandan: you were the last to look at this function. Do you think it
>>> can be improved as I suggested?
>>
>> Yes the PGs are perfectly distributed - but i think most of the people
>> would like to have a dsitribution by bytes and not pgs.
>>
>> Is this possible? I mean in the code there is already a dict for pgs,
>> objects and bytes - but i don't know how to change the logic. Just
>> remove the pgs and objects from the dict?
>
> It's worth a try to remove the pgs and objects from this dict:
> https://github.com/ceph/ceph/blob/luminous/src/pybind/mgr/balancer/module.py#L552

 Do i have to change this 3 to 1 cause we have only one item in the dict?
 I'm not sure where the 3 comes from.
 pe.score /= 3 * len(roots)

>>>
>>> I'm pretty sure that 3 is just for our 3 metrics. Indeed you can
>>> change that to 1.
>>>
>>> I'm trying this on our test cluster here too. The last few lines of
>>> output from `ceph balancer eval-verbose` will confirm that the score
>>> is based only on bytes.
>>>
>>> But I'm not sure this is going to work -- indeed the score here went
>>> from ~0.02 to 0.08, but the do_crush_compat doesn't manage to find a
>>> better score.
>>
>> Maybe this:
>>
>> https://github.com/ceph/ceph/blob/luminous/src/pybind/mgr/balancer/module.py#L682
>>
>> I'm trying with that = 'bytes'
> 
> That seems to be working. I sent this PR as a start
> https://github.com/ceph/ceph/pull/20665
> 
> I'm not sure we need to mess with the score function, on second thought.
> 
> -- dan
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Luminous Watch Live Cluster Changes Problem

2018-03-01 Thread John Spray

On Thu, Mar 1, 2018 at 12:03 PM, Georgios Dimitrakakis
 wrote:
> I have recently updated to Luminous (12.2.4) and I have noticed that using
> "ceph -w" only produces an initial output like the one below but never gets
> updated afterwards. Is this a feature because I was used to the old way that
> was constantly
> producing info.

It's intentional.  "ceph -w" is the command that follows the Ceph
cluster log.  The monitor used to dump the pg status into the cluster
log every 5 seconds, which was useful sometimes, but also made the log
pretty unreadable for anything else, because other output was quickly
swamped with the pg status spam.

To replicate the de-facto old behaviour (print the pg status every 5
seconds), you can always do something like `watch -n1 "ceph status |
grep pgs"`

There's work ongoing to create a nice replacement that does a status
stream without spamming the cluster log to accomplish it here:
https://github.com/ceph/ceph/pull/20100

Cheers,
John

>
> Here is what I get as initial output which is not updated:
>
> $ ceph -w
>   cluster:
> id: d357a551-5b7a-4501-8d8f-009c63b2c972
> health: HEALTH_OK
>
>   services:
> mon: 1 daemons, quorum node1
> mgr: node1(active)
> osd: 2 osds: 2 up, 2 in
> rgw: 1 daemon active
>
>   data:
> pools:   11 pools, 152 pgs
> objects: 9786 objects, 33754 MB
> usage:   67494 MB used, 3648 GB / 3714 GB avail
> pgs: 152 active+clean
>
>
>
> Even if I create a new volume in my Openstack installation, assign it to a
> VM, mount it and format it, I have to stop and re-execute the "ceph -w"
> command to see the following line:
>
>
>   io:
> client:   767 B/s rd, 511 B/s wr, 0 op/s rd, 0 op/s wr
>
> which also pauses after the first display.
>
>
> Kind regards,
>
>
> G.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-deploy won't install luminous (but Jewel instead)


Ah!
So you think this is done by design?

However that command is very very very usefull.
Please add that to documentation.
Next time it will save me 2/3 hours.



Il 01/03/2018 06:12, Sébastien VIGNERON ha scritto:

Hi Max,

I had the same issue (under Ubuntu 1/6/.04) but I have read the 
ceph-deploy 2.0.0 source code and saw a "—-release" flag for the 
install subcommand. You can found the flag with the following 
command: ceph-deploy install --help


It looks like the culprit part of ceph-deploy can be found around line 
20 of /usr/lib/python2.7/dist-packages/ceph_deploy/install.py:


…
    14def sanitize_args(args):
    15   """
    16   args may need a bunch of logic to set proper defaults that 
argparse is

    17   not well suited for.
    18   """
    19   if args.release is None:
    20       args.release = 'jewel'
    21       args.default_release = True
    22
    23   # XXX This whole dance is because --stable is getting deprecated
    24   if args.stable is not None:
    25       LOG.warning('the --stable flag is deprecated, use 
--release instead')

    26       args.release = args.stable
    27   # XXX Tango ends here.
    28
    29   return args

…

Which means we now have to specify "—-release luminous" when we want 
to install a luminous cluster, at least until luminous is considered 
stable and the ceph-deploy tool is changed.
I think it may be a Kernel version consideration: not all distro have 
the needed minimum version of the kernel (and features) for a full use 
of luminous.


Cordialement / Best regards,

Sébastien VIGNERON
CRIANN,
Ingénieur / Engineer
Technopôle du Madrillet
745, avenue de l'Université
76800 Saint-Etienne du Rouvray - France
tél. +33 2 32 91 42 91
fax. +33 2 32 91 42 92
http://www.criann.fr
mailto:sebastien.vigne...@criann.fr
support: supp...@criann.fr

Le 1 mars 2018 à 00:37, Max Cuttins > a écrit :


Didn't check at time.

I deployed everything from VM standalone.
The VM was just build up with fresh new centOS7.4 using minimal 
installation ISO1708.

It's a completly new/fresh/empty system.
Then I run:

yum update -y
yum install wget zip unzip vim pciutils -y
yum install epel-release -y
yum update -y
yum install ceph-deploy -y
yum install yum-plugin-priorities -y

it installed:

Feb 27 19:24:47 Installed: ceph-deploy-1.5.37-0.noarch

-> install ceph with ceph-deploy on 3 nodes.

As a result I get Jewel.

Then... I purge everything from all the 3 nodes
yum update again on ceph deployer node and get:

Feb 27 20:33:20 Updated: ceph-deploy-2.0.0-0.noarch

... then I tried to reinstall over and over but I always get Jewel.
I tryed to install after removed .ceph file config in my homedir.
I tryed to install after change default repo to repo-luminous
... got always Jewel.

Only force the release in the ceph-deploy command allow me to install 
luminous.


Probably yum-plugin-priorities should not be installed after 
ceph-deploy even if I didn't run still any command.
But what is so strange is that purge and reinstall everything will 
always reinstall Jewel.

It seems that some lock file has been write somewhere to use Jewel.



Il 28/02/2018 22:08, David Turner ha scritto:

Which version of ceph-deploy are you using?

On Wed, Feb 28, 2018 at 4:37 AM Massimiliano Cuttini 
mailto:m...@phoenixweb.it>> wrote:


This worked.

However somebody should investigate why default is still jewel
on Centos 7.4


Il 28/02/2018 00:53, jorpilo ha scritto:

Try using:
ceph-deploy --release luminous host1...

 Mensaje original 
De: Massimiliano Cuttini 

Fecha: 28/2/18 12:42 a. m. (GMT+01:00)
Para: ceph-users@lists.ceph.com 
Asunto: [ceph-users] ceph-deploy won't install luminous (but
Jewel instead)

This is the 5th time that I install and after purge the
installation.
Ceph Deploy is alway install JEWEL instead of Luminous.

No way even if I force the repo from default to luminous:

|https://download.ceph.com/rpm-luminous/el7/noarch|

It still install Jewel it's stuck.

I've already checked if I had installed yum-plugin-priorities,
and I did it.
Everything is exaclty as the documentation request.
But still I get always Jewel and not Luminous.




___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Cannot delete a pool


I was testing IO and I created a bench pool.

But if I tried to delete I get:

   Error EPERM: pool deletion is disabled; you must first set the
   mon_allow_pool_delete config option to true before you can destroy a
   pool

So I run:

   ceph tell mon.\* injectargs '--mon-allow-pool-delete=true'
   mon.ceph-node1: injectargs:mon_allow_pool_delete = 'true' (not
   observed, change may require restart)
   mon.ceph-node2: injectargs:mon_allow_pool_delete = 'true' (not
   observed, change may require restart)
   mon.ceph-node3: injectargs:mon_allow_pool_delete = 'true' (not
   observed, change may require restart)

I restarted all the nodes.
But the flag has not been observed.

Is this the right way to remove a pool?



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Luminous Watch Live Cluster Changes Problem

2018-03-01 Thread Georgios Dimitrakakis

I have recently updated to Luminous (12.2.4) and I have noticed that 
using "ceph -w" only produces an initial output like the one below but 
never gets updated afterwards. Is this a feature because I was used to 
the old way that was constantly

producing info.

Here is what I get as initial output which is not updated:

$ ceph -w
  cluster:
id: d357a551-5b7a-4501-8d8f-009c63b2c972
health: HEALTH_OK

  services:
mon: 1 daemons, quorum node1
mgr: node1(active)
osd: 2 osds: 2 up, 2 in
rgw: 1 daemon active

  data:
pools:   11 pools, 152 pgs
objects: 9786 objects, 33754 MB
usage:   67494 MB used, 3648 GB / 3714 GB avail
pgs: 152 active+clean



Even if I create a new volume in my Openstack installation, assign it 
to a VM, mount it and format it, I have to stop and re-execute the "ceph 
-w" command to see the following line:



  io:
client:   767 B/s rd, 511 B/s wr, 0 op/s rd, 0 op/s wr

which also pauses after the first display.


Kind regards,


G.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Memory leak in Ceph OSD?

2018-03-01 Thread Caspar Smit

Stefan,

How many OSD's and how much RAM are in each server?

bluestore_cache_size=6G will not mean each OSD is using max 6GB RAM right?

Our bluestore hdd OSD's with bluestore_cache_size at 1G use ~4GB of total
RAM. The cache is a part of the memory usage by bluestore OSD's.

Kind regards,
Caspar

2018-02-28 16:18 GMT+01:00 Stefan Kooman :

> Hi,
>
> TL;DR: we see "used" memory grows indefinitely on our OSD servers.
> Until the point that either 1) a OSD process gets killed by OOMkiller,
> or 2) OSD aborts (proably because malloc cannot provide more RAM). I
> suspect a memory leak of the OSDs.
>
> We were running 12.2.2. We are now running 12.2.3. Replicated setup,
> SIZE=3, MIN_SIZE=2. All servers were rebooted. The "used" memory is
> slowly, but steadily growing.
>
> ceph.conf:
> bluestore_cache_size=6G
>
> ceph daemon osd.$daemon dump_mempools info gives:
>
> "total": {
> "items": 52925781,
> "bytes": 6058227868
>
> ... for roughly all OSDs. So the OSD process is not "exceeding" what it
> *thinks* it's using.
>
> We haven't noticed this during the "pre-production" phase of the cluster.
> Main
> difference with "pre-production" and "production" is that we are using
> "compression" on the pool.
>
> ceph osd pool set $pool compression_algorithm snappy
> ceph osd pool set $pool compression_mode aggressive
>
> I haven't seen any of you complaining about memory leaks besides the well
> know
> leak in 12.2.1. How many of you are using compression like this? If it has
> anything to do with this at all ...
>
> Currently at ~ 60 GB used with 2 days uptime. 42 GB of RAM usage for all
> OSDs
> ... 18 GB leaked?
>
> If Ceph keeps releasing minor versions so quickly it will never really
> become a
> big problem ;-).
>
> Any hints to analyse this issue?
>
> Gr. Stefan
>
>
>
>
> --
> | BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
> | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph mgr balancer bad distribution

On Thu, Mar 1, 2018 at 10:40 AM, Dan van der Ster  wrote:
> On Thu, Mar 1, 2018 at 10:38 AM, Dan van der Ster  wrote:
>> On Thu, Mar 1, 2018 at 10:24 AM, Stefan Priebe - Profihost AG
>>  wrote:
>>>
>>> Am 01.03.2018 um 09:58 schrieb Dan van der Ster:
 On Thu, Mar 1, 2018 at 9:52 AM, Stefan Priebe - Profihost AG
  wrote:
> Hi,
>
> Am 01.03.2018 um 09:42 schrieb Dan van der Ster:
>> On Thu, Mar 1, 2018 at 9:31 AM, Stefan Priebe - Profihost AG
>>  wrote:
>>> Hi,
>>> Am 01.03.2018 um 09:03 schrieb Dan van der Ster:
 Is the score improving?

 ceph balancer eval

 It should be decreasing over time as the variances drop toward zero.

 You mentioned a crush optimize code at the beginning... how did that
 leave your cluster? The mgr balancer assumes that the crush weight of
 each OSD is equal to its size in TB.
 Do you have any osd reweights? crush-compat will gradually adjust
 those back to 1.0.
>>>
>>> I reweighted them all back to their correct weight.
>>>
>>> Now the mgr balancer module says:
>>> mgr[balancer] Failed to find further optimization, score 0.010646
>>>
>>> But as you can see it's heavily imbalanced:
>>>
>>>
>>> Example:
>>> 49   ssd 0.84000  1.0   864G   546G   317G 63.26 1.13  49
>>>
>>> vs:
>>>
>>> 48   ssd 0.84000  1.0   864G   397G   467G 45.96 0.82  49
>>>
>>> 45% usage vs. 63%
>>
>> Ahh... but look, the num PGs are perfectly balanced, which implies
>> that you have a relatively large number of empty PGs.
>>
>> But regardless, this is annoying and I expect lots of operators to get
>> this result. (I've also observed that the num PGs is gets balanced
>> perfectly at the expense of the other score metrics.)
>>
>> I was thinking of a patch around here [1] that lets operators add a
>> score weight on pgs, objects, bytes so we can balance how we like.
>>
>> Spandan: you were the last to look at this function. Do you think it
>> can be improved as I suggested?
>
> Yes the PGs are perfectly distributed - but i think most of the people
> would like to have a dsitribution by bytes and not pgs.
>
> Is this possible? I mean in the code there is already a dict for pgs,
> objects and bytes - but i don't know how to change the logic. Just
> remove the pgs and objects from the dict?

 It's worth a try to remove the pgs and objects from this dict:
 https://github.com/ceph/ceph/blob/luminous/src/pybind/mgr/balancer/module.py#L552
>>>
>>> Do i have to change this 3 to 1 cause we have only one item in the dict?
>>> I'm not sure where the 3 comes from.
>>> pe.score /= 3 * len(roots)
>>>
>>
>> I'm pretty sure that 3 is just for our 3 metrics. Indeed you can
>> change that to 1.
>>
>> I'm trying this on our test cluster here too. The last few lines of
>> output from `ceph balancer eval-verbose` will confirm that the score
>> is based only on bytes.
>>
>> But I'm not sure this is going to work -- indeed the score here went
>> from ~0.02 to 0.08, but the do_crush_compat doesn't manage to find a
>> better score.
>
> Maybe this:
>
> https://github.com/ceph/ceph/blob/luminous/src/pybind/mgr/balancer/module.py#L682
>
> I'm trying with that = 'bytes'

That seems to be working. I sent this PR as a start
https://github.com/ceph/ceph/pull/20665

I'm not sure we need to mess with the score function, on second thought.

-- dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph Developer Monthly - March 2018

2018-03-01 Thread Lenz Grimmer

On 02/28/2018 11:51 PM, Sage Weil wrote:

> On Wed, 28 Feb 2018, Dan Mick wrote:
> 
>> Would anyone else appreciate a Google Calendar invitation for the
>> CDMs? Seems like a natural.
> 
> Funny you should mention it!  I was just talking to Leo this morning
> about creating a public Ceph Events calendar that has all of the
> public events (CDM, tech talks, weekly perf call, etc.).
> 
> (Also, we're setting up a Ceph Meetings calendar for meetings that
> aren't completely public that can be shared with active developers
> for standing meetings that are currently invite-only meetings.  e.g.,
> standups, advisory board, etc.)

That'd be excellent - +1

Thanks!

Lenz

-- 
SUSE Linux GmbH - Maxfeldstr. 5 - 90409 Nuernberg (Germany)
GF:Felix Imendörffer,Jane Smithard,Graham Norton,HRB 21284 (AG Nürnberg)



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph mgr balancer bad distribution

On Thu, Mar 1, 2018 at 10:38 AM, Dan van der Ster  wrote:
> On Thu, Mar 1, 2018 at 10:24 AM, Stefan Priebe - Profihost AG
>  wrote:
>>
>> Am 01.03.2018 um 09:58 schrieb Dan van der Ster:
>>> On Thu, Mar 1, 2018 at 9:52 AM, Stefan Priebe - Profihost AG
>>>  wrote:
 Hi,

 Am 01.03.2018 um 09:42 schrieb Dan van der Ster:
> On Thu, Mar 1, 2018 at 9:31 AM, Stefan Priebe - Profihost AG
>  wrote:
>> Hi,
>> Am 01.03.2018 um 09:03 schrieb Dan van der Ster:
>>> Is the score improving?
>>>
>>> ceph balancer eval
>>>
>>> It should be decreasing over time as the variances drop toward zero.
>>>
>>> You mentioned a crush optimize code at the beginning... how did that
>>> leave your cluster? The mgr balancer assumes that the crush weight of
>>> each OSD is equal to its size in TB.
>>> Do you have any osd reweights? crush-compat will gradually adjust
>>> those back to 1.0.
>>
>> I reweighted them all back to their correct weight.
>>
>> Now the mgr balancer module says:
>> mgr[balancer] Failed to find further optimization, score 0.010646
>>
>> But as you can see it's heavily imbalanced:
>>
>>
>> Example:
>> 49   ssd 0.84000  1.0   864G   546G   317G 63.26 1.13  49
>>
>> vs:
>>
>> 48   ssd 0.84000  1.0   864G   397G   467G 45.96 0.82  49
>>
>> 45% usage vs. 63%
>
> Ahh... but look, the num PGs are perfectly balanced, which implies
> that you have a relatively large number of empty PGs.
>
> But regardless, this is annoying and I expect lots of operators to get
> this result. (I've also observed that the num PGs is gets balanced
> perfectly at the expense of the other score metrics.)
>
> I was thinking of a patch around here [1] that lets operators add a
> score weight on pgs, objects, bytes so we can balance how we like.
>
> Spandan: you were the last to look at this function. Do you think it
> can be improved as I suggested?

 Yes the PGs are perfectly distributed - but i think most of the people
 would like to have a dsitribution by bytes and not pgs.

 Is this possible? I mean in the code there is already a dict for pgs,
 objects and bytes - but i don't know how to change the logic. Just
 remove the pgs and objects from the dict?
>>>
>>> It's worth a try to remove the pgs and objects from this dict:
>>> https://github.com/ceph/ceph/blob/luminous/src/pybind/mgr/balancer/module.py#L552
>>
>> Do i have to change this 3 to 1 cause we have only one item in the dict?
>> I'm not sure where the 3 comes from.
>> pe.score /= 3 * len(roots)
>>
>
> I'm pretty sure that 3 is just for our 3 metrics. Indeed you can
> change that to 1.
>
> I'm trying this on our test cluster here too. The last few lines of
> output from `ceph balancer eval-verbose` will confirm that the score
> is based only on bytes.
>
> But I'm not sure this is going to work -- indeed the score here went
> from ~0.02 to 0.08, but the do_crush_compat doesn't manage to find a
> better score.

Maybe this:

https://github.com/ceph/ceph/blob/luminous/src/pybind/mgr/balancer/module.py#L682

I'm trying with that = 'bytes'

-- dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph mgr balancer bad distribution

On Thu, Mar 1, 2018 at 10:24 AM, Stefan Priebe - Profihost AG
 wrote:
>
> Am 01.03.2018 um 09:58 schrieb Dan van der Ster:
>> On Thu, Mar 1, 2018 at 9:52 AM, Stefan Priebe - Profihost AG
>>  wrote:
>>> Hi,
>>>
>>> Am 01.03.2018 um 09:42 schrieb Dan van der Ster:
 On Thu, Mar 1, 2018 at 9:31 AM, Stefan Priebe - Profihost AG
  wrote:
> Hi,
> Am 01.03.2018 um 09:03 schrieb Dan van der Ster:
>> Is the score improving?
>>
>> ceph balancer eval
>>
>> It should be decreasing over time as the variances drop toward zero.
>>
>> You mentioned a crush optimize code at the beginning... how did that
>> leave your cluster? The mgr balancer assumes that the crush weight of
>> each OSD is equal to its size in TB.
>> Do you have any osd reweights? crush-compat will gradually adjust
>> those back to 1.0.
>
> I reweighted them all back to their correct weight.
>
> Now the mgr balancer module says:
> mgr[balancer] Failed to find further optimization, score 0.010646
>
> But as you can see it's heavily imbalanced:
>
>
> Example:
> 49   ssd 0.84000  1.0   864G   546G   317G 63.26 1.13  49
>
> vs:
>
> 48   ssd 0.84000  1.0   864G   397G   467G 45.96 0.82  49
>
> 45% usage vs. 63%

 Ahh... but look, the num PGs are perfectly balanced, which implies
 that you have a relatively large number of empty PGs.

 But regardless, this is annoying and I expect lots of operators to get
 this result. (I've also observed that the num PGs is gets balanced
 perfectly at the expense of the other score metrics.)

 I was thinking of a patch around here [1] that lets operators add a
 score weight on pgs, objects, bytes so we can balance how we like.

 Spandan: you were the last to look at this function. Do you think it
 can be improved as I suggested?
>>>
>>> Yes the PGs are perfectly distributed - but i think most of the people
>>> would like to have a dsitribution by bytes and not pgs.
>>>
>>> Is this possible? I mean in the code there is already a dict for pgs,
>>> objects and bytes - but i don't know how to change the logic. Just
>>> remove the pgs and objects from the dict?
>>
>> It's worth a try to remove the pgs and objects from this dict:
>> https://github.com/ceph/ceph/blob/luminous/src/pybind/mgr/balancer/module.py#L552
>
> Do i have to change this 3 to 1 cause we have only one item in the dict?
> I'm not sure where the 3 comes from.
> pe.score /= 3 * len(roots)
>

I'm pretty sure that 3 is just for our 3 metrics. Indeed you can
change that to 1.

I'm trying this on our test cluster here too. The last few lines of
output from `ceph balancer eval-verbose` will confirm that the score
is based only on bytes.

But I'm not sure this is going to work -- indeed the score here went
from ~0.02 to 0.08, but the do_crush_compat doesn't manage to find a
better score.

-- Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph mgr balancer bad distribution


Am 01.03.2018 um 09:58 schrieb Dan van der Ster:
> On Thu, Mar 1, 2018 at 9:52 AM, Stefan Priebe - Profihost AG
>  wrote:
>> Hi,
>>
>> Am 01.03.2018 um 09:42 schrieb Dan van der Ster:
>>> On Thu, Mar 1, 2018 at 9:31 AM, Stefan Priebe - Profihost AG
>>>  wrote:
 Hi,
 Am 01.03.2018 um 09:03 schrieb Dan van der Ster:
> Is the score improving?
>
> ceph balancer eval
>
> It should be decreasing over time as the variances drop toward zero.
>
> You mentioned a crush optimize code at the beginning... how did that
> leave your cluster? The mgr balancer assumes that the crush weight of
> each OSD is equal to its size in TB.
> Do you have any osd reweights? crush-compat will gradually adjust
> those back to 1.0.

 I reweighted them all back to their correct weight.

 Now the mgr balancer module says:
 mgr[balancer] Failed to find further optimization, score 0.010646

 But as you can see it's heavily imbalanced:


 Example:
 49   ssd 0.84000  1.0   864G   546G   317G 63.26 1.13  49

 vs:

 48   ssd 0.84000  1.0   864G   397G   467G 45.96 0.82  49

 45% usage vs. 63%
>>>
>>> Ahh... but look, the num PGs are perfectly balanced, which implies
>>> that you have a relatively large number of empty PGs.
>>>
>>> But regardless, this is annoying and I expect lots of operators to get
>>> this result. (I've also observed that the num PGs is gets balanced
>>> perfectly at the expense of the other score metrics.)
>>>
>>> I was thinking of a patch around here [1] that lets operators add a
>>> score weight on pgs, objects, bytes so we can balance how we like.
>>>
>>> Spandan: you were the last to look at this function. Do you think it
>>> can be improved as I suggested?
>>
>> Yes the PGs are perfectly distributed - but i think most of the people
>> would like to have a dsitribution by bytes and not pgs.
>>
>> Is this possible? I mean in the code there is already a dict for pgs,
>> objects and bytes - but i don't know how to change the logic. Just
>> remove the pgs and objects from the dict?
> 
> It's worth a try to remove the pgs and objects from this dict:
> https://github.com/ceph/ceph/blob/luminous/src/pybind/mgr/balancer/module.py#L552

Do i have to change this 3 to 1 cause we have only one item in the dict?
I'm not sure where the 3 comes from.
pe.score /= 3 * len(roots)


> You can update that directly in the python code on your mgr's. Turn
> the ceph balancer off then failover to the next mgr so it reloads the
> module. Then:
> 
> ceph balancer eval
> ceph balancer optimize myplan
> ceph balancer eval myplan
> 
> Does it move in the right direction?
> 
> -- dan
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph mgr balancer bad distribution

On Thu, Mar 1, 2018 at 9:52 AM, Stefan Priebe - Profihost AG
 wrote:
> Hi,
>
> Am 01.03.2018 um 09:42 schrieb Dan van der Ster:
>> On Thu, Mar 1, 2018 at 9:31 AM, Stefan Priebe - Profihost AG
>>  wrote:
>>> Hi,
>>> Am 01.03.2018 um 09:03 schrieb Dan van der Ster:
 Is the score improving?

 ceph balancer eval

 It should be decreasing over time as the variances drop toward zero.

 You mentioned a crush optimize code at the beginning... how did that
 leave your cluster? The mgr balancer assumes that the crush weight of
 each OSD is equal to its size in TB.
 Do you have any osd reweights? crush-compat will gradually adjust
 those back to 1.0.
>>>
>>> I reweighted them all back to their correct weight.
>>>
>>> Now the mgr balancer module says:
>>> mgr[balancer] Failed to find further optimization, score 0.010646
>>>
>>> But as you can see it's heavily imbalanced:
>>>
>>>
>>> Example:
>>> 49   ssd 0.84000  1.0   864G   546G   317G 63.26 1.13  49
>>>
>>> vs:
>>>
>>> 48   ssd 0.84000  1.0   864G   397G   467G 45.96 0.82  49
>>>
>>> 45% usage vs. 63%
>>
>> Ahh... but look, the num PGs are perfectly balanced, which implies
>> that you have a relatively large number of empty PGs.
>>
>> But regardless, this is annoying and I expect lots of operators to get
>> this result. (I've also observed that the num PGs is gets balanced
>> perfectly at the expense of the other score metrics.)
>>
>> I was thinking of a patch around here [1] that lets operators add a
>> score weight on pgs, objects, bytes so we can balance how we like.
>>
>> Spandan: you were the last to look at this function. Do you think it
>> can be improved as I suggested?
>
> Yes the PGs are perfectly distributed - but i think most of the people
> would like to have a dsitribution by bytes and not pgs.
>
> Is this possible? I mean in the code there is already a dict for pgs,
> objects and bytes - but i don't know how to change the logic. Just
> remove the pgs and objects from the dict?

It's worth a try to remove the pgs and objects from this dict:

https://github.com/ceph/ceph/blob/luminous/src/pybind/mgr/balancer/module.py#L552

You can update that directly in the python code on your mgr's. Turn
the ceph balancer off then failover to the next mgr so it reloads the
module. Then:

ceph balancer eval
ceph balancer optimize myplan
ceph balancer eval myplan

Does it move in the right direction?

-- dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph mgr balancer bad distribution

Hi,

Am 01.03.2018 um 09:42 schrieb Dan van der Ster:
> On Thu, Mar 1, 2018 at 9:31 AM, Stefan Priebe - Profihost AG
>  wrote:
>> Hi,
>> Am 01.03.2018 um 09:03 schrieb Dan van der Ster:
>>> Is the score improving?
>>>
>>> ceph balancer eval
>>>
>>> It should be decreasing over time as the variances drop toward zero.
>>>
>>> You mentioned a crush optimize code at the beginning... how did that
>>> leave your cluster? The mgr balancer assumes that the crush weight of
>>> each OSD is equal to its size in TB.
>>> Do you have any osd reweights? crush-compat will gradually adjust
>>> those back to 1.0.
>>
>> I reweighted them all back to their correct weight.
>>
>> Now the mgr balancer module says:
>> mgr[balancer] Failed to find further optimization, score 0.010646
>>
>> But as you can see it's heavily imbalanced:
>>
>>
>> Example:
>> 49   ssd 0.84000  1.0   864G   546G   317G 63.26 1.13  49
>>
>> vs:
>>
>> 48   ssd 0.84000  1.0   864G   397G   467G 45.96 0.82  49
>>
>> 45% usage vs. 63%
> 
> Ahh... but look, the num PGs are perfectly balanced, which implies
> that you have a relatively large number of empty PGs.
> 
> But regardless, this is annoying and I expect lots of operators to get
> this result. (I've also observed that the num PGs is gets balanced
> perfectly at the expense of the other score metrics.)
> 
> I was thinking of a patch around here [1] that lets operators add a
> score weight on pgs, objects, bytes so we can balance how we like.
> 
> Spandan: you were the last to look at this function. Do you think it
> can be improved as I suggested?

Yes the PGs are perfectly distributed - but i think most of the people
would like to have a dsitribution by bytes and not pgs.

Is this possible? I mean in the code there is already a dict for pgs,
objects and bytes - but i don't know how to change the logic. Just
remove the pgs and objects from the dict?

> Cheers, Dan
> 
> [1] 
> https://github.com/ceph/ceph/blob/luminous/src/pybind/mgr/balancer/module.py#L558
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph mgr balancer bad distribution

On Thu, Mar 1, 2018 at 9:31 AM, Stefan Priebe - Profihost AG
 wrote:
> Hi,
> Am 01.03.2018 um 09:03 schrieb Dan van der Ster:
>> Is the score improving?
>>
>> ceph balancer eval
>>
>> It should be decreasing over time as the variances drop toward zero.
>>
>> You mentioned a crush optimize code at the beginning... how did that
>> leave your cluster? The mgr balancer assumes that the crush weight of
>> each OSD is equal to its size in TB.
>> Do you have any osd reweights? crush-compat will gradually adjust
>> those back to 1.0.
>
> I reweighted them all back to their correct weight.
>
> Now the mgr balancer module says:
> mgr[balancer] Failed to find further optimization, score 0.010646
>
> But as you can see it's heavily imbalanced:
>
>
> Example:
> 49   ssd 0.84000  1.0   864G   546G   317G 63.26 1.13  49
>
> vs:
>
> 48   ssd 0.84000  1.0   864G   397G   467G 45.96 0.82  49
>
> 45% usage vs. 63%

Ahh... but look, the num PGs are perfectly balanced, which implies
that you have a relatively large number of empty PGs.

But regardless, this is annoying and I expect lots of operators to get
this result. (I've also observed that the num PGs is gets balanced
perfectly at the expense of the other score metrics.)

I was thinking of a patch around here [1] that lets operators add a
score weight on pgs, objects, bytes so we can balance how we like.

Spandan: you were the last to look at this function. Do you think it
can be improved as I suggested?

Cheers, Dan

[1] 
https://github.com/ceph/ceph/blob/luminous/src/pybind/mgr/balancer/module.py#L558
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph mgr balancer bad distribution

Hi,
Am 01.03.2018 um 09:03 schrieb Dan van der Ster:
> Is the score improving?
> 
> ceph balancer eval
> 
> It should be decreasing over time as the variances drop toward zero.
> 
> You mentioned a crush optimize code at the beginning... how did that
> leave your cluster? The mgr balancer assumes that the crush weight of
> each OSD is equal to its size in TB.
> Do you have any osd reweights? crush-compat will gradually adjust
> those back to 1.0.

I reweighted them all back to their correct weight.

Now the mgr balancer module says:
mgr[balancer] Failed to find further optimization, score 0.010646

But as you can see it's heavily imbalanced:


Example:
49   ssd 0.84000  1.0   864G   546G   317G 63.26 1.13  49

vs:

48   ssd 0.84000  1.0   864G   397G   467G 45.96 0.82  49

45% usage vs. 63%

Greets,
Stefan

> 
> Cheers, Dan
> 
> 
> 
> On Thu, Mar 1, 2018 at 8:27 AM, Stefan Priebe - Profihost AG
>  wrote:
>> Does anybody have some more input?
>>
>> I keeped the balancer active for 24h now and it is rebalancing 1-3%
>> every 30 minutes but the distribution is still bad.
>>
>> It seems to balance from left to right and than back from right to left...
>>
>> Greets,
>> Stefan
>>
>> Am 28.02.2018 um 13:47 schrieb Stefan Priebe - Profihost AG:
>>> Hello,
>>>
>>> with jewel we always used the python crush optimizer which gave us a
>>> pretty good distribution fo the used space.
>>>
>>> Since luminous we're using the included ceph mgr balancer but the
>>> distribution is far from perfect and much worse than the old method.
>>>
>>> Is there any way to tune the mgr balancer?
>>>
>>> Currently after a balance we still have:
>>> 75% to 92% disk usage which is pretty unfair
>>>
>>> Greets,
>>> Stefan
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Ceph and multiple RDMA NICs

2018-03-01 Thread Justinas LINGYS

Hi all,

I am running a small Ceph cluster  (1 MON and 3OSDs), and it works fine.
However, I have a doubt about the two networks (public and cluster) that an OSD 
uses.
There is a reference from Mellanox 
(https://community.mellanox.com/docs/DOC-2721) how to configure 'ceph.conf'. 
However, after reading the source code (luminous-stable), I get a feeling that 
we cannot run Ceph with two NICs/Ports as we only have one 
'ms_async_rdma_local_gid' per OSD, and it seems that the source code only uses 
one option (NIC). I would like to ask how I could communicate with the public 
network via one RDMA NIC and communicate  with the cluster network via another 
RDMA NIC (apply RoCEV2 to both NICs). Since gids are unique within a machine, 
how can I use two different gids in 'ceph.conf'? 

Justin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph mgr balancer bad distribution