[ceph-users] qemu/rbd: threads vs native, performance tuning

2018-09-27 Thread Elias Abacioglu
Hi,

I was reading this thread:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008486.html

And I am trying to get better performance in my virtual machines.
These are my RBD settings:
"rbd_cache": "true",
"rbd_cache_block_writes_upfront": "false",
"rbd_cache_max_dirty": "25165824",
"rbd_cache_max_dirty_age": "1.00",
"rbd_cache_max_dirty_object": "0",
"rbd_cache_size": "33554432",
"rbd_cache_target_dirty": "16777216",
"rbd_cache_writethrough_until_flush": "true",

I decided to test native mode and ran fio like this inside a VM:
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test
--filename=random_read_write.fio --bs=4k --iodepth=64 --size=4G
--readwrite=randrw --rwmixread=75

I tested these two setups in qemu.



I ran fio a couple times to have a little variance and here is the results:

 READ: io=3071.7MB, aggrb=96718KB/s, minb=96718KB/s, maxb=96718KB/s,
mint=32521msec, maxt=32521msec
WRITE: io=1024.4MB, aggrb=32253KB/s, minb=32253KB/s, maxb=32253KB/s,
mint=32521msec, maxt=32521msec
 READ: io=3071.7MB, aggrb=96451KB/s, minb=96451KB/s, maxb=96451KB/s,
mint=32611msec, maxt=32611msec
WRITE: io=1024.4MB, aggrb=32164KB/s, minb=32164KB/s, maxb=32164KB/s,
mint=32611msec, maxt=32611msec
 READ: io=3071.7MB, aggrb=93763KB/s, minb=93763KB/s, maxb=93763KB/s,
mint=33546msec, maxt=33546msec
WRITE: io=1024.4MB, aggrb=31267KB/s, minb=31267KB/s, maxb=31267KB/s,
mint=33546msec, maxt=33546msec
---

DISK = [ driver = "raw" , cache = "directsync" , discard = "unmap" , io
= "native" ]
 READ: io=3071.7MB, aggrb=68771KB/s, minb=68771KB/s, maxb=68771KB/s,
mint=45737msec, maxt=45737msec
WRITE: io=1024.4MB, aggrb=22933KB/s, minb=22933KB/s, maxb=22933KB/s,
mint=45737msec, maxt=45737msec
 READ: io=3071.7MB, aggrb=67794KB/s, minb=67794KB/s, maxb=67794KB/s,
mint=46396msec, maxt=46396msec
WRITE: io=1024.4MB, aggrb=22607KB/s, minb=22607KB/s, maxb=22607KB/s,
mint=46396msec, maxt=46396msec
 READ: io=3071.7MB, aggrb=67536KB/s, minb=67536KB/s, maxb=67536KB/s,
mint=46573msec, maxt=46573msec
WRITE: io=1024.4MB, aggrb=22521KB/s, minb=22521KB/s, maxb=22521KB/s,
mint=46573msec, maxt=46573msec

So native is around 30-40% faster than threads according to this.
But I have a few questions now.
1. Is it safe to run cache='directsync' io='native' documentation refers to
writeback/threads?
2. How can I get even better performance? These benchmarks are from a pool
with 11 NVME Bluestore OSDs, 2x10Gb NIC. It feels pretty slow IMO.

Thanks,
Elias
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cannot write to cephfs if some osd's are not available on the client network

2018-09-27 Thread Marc Roos


I have a test cluster and on a osd node I put a vm. The vm is using a 
macvtap on the client network interface of the osd node. Making access 
to local osd's impossible.

the vm of course reports that it cannot access the local osd's. What I 
am getting is:

- I cannot reboot this vm normally, need to reset it.
- vm is reporting very high load.

I guess this should not be happening not? Because it should choose an 
other available osd of the 3x replicated pool and just write the data to 
that one?









___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cannot write to cephfs if some osd's are not available on the client network

2018-09-27 Thread Burkhard Linke

Hi,


On 09/27/2018 11:15 AM, Marc Roos wrote:

I have a test cluster and on a osd node I put a vm. The vm is using a
macvtap on the client network interface of the osd node. Making access
to local osd's impossible.

the vm of course reports that it cannot access the local osd's. What I
am getting is:

- I cannot reboot this vm normally, need to reset it.
- vm is reporting very high load.

I guess this should not be happening not? Because it should choose an
other available osd of the 3x replicated pool and just write the data to
that one?
All I/O is only send to the OSD with the primary PG replicate. If this 
happens to be the OSD on the same host, I/O will be stuck.


The high I/O load is due to async I/O threads in the kernel trying to 
access the OSD.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cannot write to cephfs if some osd's are not available on the client network

2018-09-27 Thread John Spray
On Thu, Sep 27, 2018 at 10:16 AM Marc Roos  wrote:
>
>
> I have a test cluster and on a osd node I put a vm. The vm is using a
> macvtap on the client network interface of the osd node. Making access
> to local osd's impossible.
>
> the vm of course reports that it cannot access the local osd's. What I
> am getting is:
>
> - I cannot reboot this vm normally, need to reset it.

When linux tries to shut down cleanly, part of that is flushing
buffers from any mounted filesystem back to disk.  If you have a
network filesystem mounted, and the network is unavailable, that can
cause the process to block.  You can try forcibly unmounting before
rebooting.

> - vm is reporting very high load.

The CPU load part is surprising -- in general Ceph clients should wait
quietly when blocked, rather than spinning.

> I guess this should not be happening not? Because it should choose an
> other available osd of the 3x replicated pool and just write the data to
> that one?

No -- writes always go through the primary OSD for the PG being
written to.  If an OSD goes down, then another OSD will become the
primary.  In your case, the primary OSD is not going down, it's just
being cut off from the client by the network, so the writes are
blocking indefinitely.

John

>
>
>
>
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs-data-scan tool

2018-09-27 Thread Sergey Malinin
Hello,
Does anybody have experience with using cephfs-data-scan tool?
Questions I have are how long would it take to scan extents on filesystem with 
120M relatively small files? While running extents scan I noticed that number 
of objects in data pool is decreasing over the time. Is that normal?
Thanks.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs-data-scan tool

2018-09-27 Thread John Spray
On Thu, Sep 27, 2018 at 11:03 AM Sergey Malinin  wrote:
>
> Hello,
> Does anybody have experience with using cephfs-data-scan tool?
> Questions I have are how long would it take to scan extents on filesystem 
> with 120M relatively small files? While running extents scan I noticed that 
> number of objects in data pool is decreasing over the time. Is that normal?

The scan_extents operation does not do any deletions, so that is
surprising.  Is it possible that you've accidentially left an MDS
running?

John

John

> Thanks.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: [Ceph-community] After Mimic upgrade OSD's stuck at booting.

2018-09-27 Thread Willem Jan Withagen

On 26/09/2018 12:41, Eugen Block wrote:

Hi,

I'm not sure how the recovery "still works" with the flag "norecover".
Anyway, I think you should unset the flags norecover, nobackfill. Even 
if not all OSDs come back up you should allow the cluster to backfill 
PGs. Not sure, but unsetting norebalance could also be useful, but 
that can be done step by step. First watch if the cluster gets any 
better without it.


The best way to see if recovery is doing its stuff, is look at the 
recovering pgs in

    ceph pg dump

And look at the objects and see that some of the counters are actually 
going down.

If they don't then the PG is not recovering/backfilling.

Haven't found a better way to determine this (yet).

--WjW


And can you check the plan "peetaur2" offered from IRC:
https://bpaste.net/show/20581774ff08
Also Be_El strongly offers to unset nodown parameter.


The suggested config settings look reasonable to mee. You should also 
try to raise the timeouts for the MONs and increase their db cache as 
suggested earlier today.


after this point, if an osd is down, it's fine...it'll only prevent 
access to that specific data (bad for clients, fine for recovery)


I agree with that, the cluster state has to become stable first, then 
you can take a look into those OSDs that won't get up.


Regards,
Eugen


Zitat von by morphin :


Hello Eugen.  Thank you for your answer. I was loosing my hope to get
an answer here.

I faced so many times with losing 2/3 mons but I never faced any
problem like this on luminous.
The recovery still works and its have been 30hours.  The last state of
my cluster is: https://paste.ubuntu.com/p/rDNHCcNG7P/
We are discussing should we unset the nodown, norecover flags or not 
on IRC.


I tried unset the nodown flag yesterday and I have 15 osd do not start
anymore with same error --> : https://paste.ubuntu.com/p/94xpzxTSnr/
I dont know what is the reason of this but I saw some commits for the
dump problem. Is this bug or something else?

And can you check the plan "peetaur2" offered from IRC:
https://bpaste.net/show/20581774ff08
Also Be_El strongly offers to unset nodown parameter.
What do you think?
Eugen Block , 26 Eyl 2018 Çar, 12:54 tarihinde şunu 
yazdı:


Hi,

could this be related to this other Mimic upgrade thread [1]? Your
failing MONs sound a bit like the problem described there, eventually
the user reported recovery success. You could try the described steps:

  - disable cephx auth with 'auth_cluster_required = none'
  - set the mon_osd_cache_size = 20 (default 10)
  - Setting 'osd_heartbeat_interval = 30'
  - setting 'mon_lease = 75'
  - increase the rocksdb_cache_size and leveldb_cache_size on the mons
to be big enough to cache the entire db

I just copied the mentioned steps, so please read the thread before
applying anything.

Regards,
Eugen

[1]
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030018.html 




Zitat von by morphin :

> After I tried too many things with so many helps on IRC. My pool
> health is still in ERROR and I think I can't recover from this.
> https://paste.ubuntu.com/p/HbsFnfkYDT/
> At the end 2 of 3 mons crashed and started at same time and the pool
> is offlined. Recovery takes more than 12hours and it is way too slow.
> Somehow recovery seems to be not working.
>
> If I can reach my data I will re-create the pool easily.
> If I run ceph-object-tool script to regenerate mon store.db can I
> acccess the RBD pool again?
> by morphin , 25 Eyl 2018 Sal, 20:03
> tarihinde şunu yazdı:
>>
>> Hi,
>>
>> Cluster is still down :(
>>
>> Up to not we have managed to compensate the OSDs. 118s of 160 OSD 
are

>> stable and cluster is still in the progress of settling. Thanks for
>> the guy Be-El in the ceph IRC channel. Be-El helped a lot to make
>> flapping OSDs stable.
>>
>> What we learned up now is that this is the cause of unsudden 
death of

>> 2 monitor servers of 3. And when they come back if they do not start
>> one by one (each after joining cluster) this can happen. Cluster can
>> be unhealty and it can take countless hour to come back.
>>
>> Right now here is our status:
>> ceph -s : https://paste.ubuntu.com/p/6DbgqnGS7t/
>> health detail: https://paste.ubuntu.com/p/w4gccnqZjR/
>>
>> Since OSDs disks are NL-SAS it can take up to 24 hours for an online
>> cluster. What is most it has been said that we could be extremely 
luck

>> if all the data is rescued.
>>
>> Most unhappily our strategy is just to sit and wait :(. As soon 
as the
>> peering and activating count drops to 300-500 pgs we will restart 
the
>> stopped OSDs one by one. For each OSD and we will wait the 
cluster to

>> settle down. The amount of data stored is OSD is 33TB. Our most
>> concern is to export our rbd pool data outside to a backup space. 
Then

>> we will start again with clean one.
>>
>> I hope to justify our analysis with an expert. Any help or advise
>> would be greatly appreciated.
>> by morphin , 25 Eyl 2018 Sal, 15:08
>> tarihinde şunu yazd

Re: [ceph-users] cephfs-data-scan tool

2018-09-27 Thread Sergey Malinin
I'm trying alternate metadata pool approach. I double checked that MDS servers 
are down and both original and recovery fs are set not joinable.


> On 27.09.2018, at 13:10, John Spray  wrote:
> 
> On Thu, Sep 27, 2018 at 11:03 AM Sergey Malinin  wrote:
>> 
>> Hello,
>> Does anybody have experience with using cephfs-data-scan tool?
>> Questions I have are how long would it take to scan extents on filesystem 
>> with 120M relatively small files? While running extents scan I noticed that 
>> number of objects in data pool is decreasing over the time. Is that normal?
> 
> The scan_extents operation does not do any deletions, so that is
> surprising.  Is it possible that you've accidentially left an MDS
> running?
> 
> John
> 
> John
> 
>> Thanks.
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs-data-scan tool

2018-09-27 Thread Sergey Malinin
Can such behaviour be related to data pool cache tiering?


> On 27.09.2018, at 13:14, Sergey Malinin  wrote:
> 
> I'm trying alternate metadata pool approach. I double checked that MDS 
> servers are down and both original and recovery fs are set not joinable.
> 
> 
>> On 27.09.2018, at 13:10, John Spray  wrote:
>> 
>> On Thu, Sep 27, 2018 at 11:03 AM Sergey Malinin  wrote:
>>> 
>>> Hello,
>>> Does anybody have experience with using cephfs-data-scan tool?
>>> Questions I have are how long would it take to scan extents on filesystem 
>>> with 120M relatively small files? While running extents scan I noticed that 
>>> number of objects in data pool is decreasing over the time. Is that normal?
>> 
>> The scan_extents operation does not do any deletions, so that is
>> surprising.  Is it possible that you've accidentially left an MDS
>> running?
>> 
>> John
>> 
>> John
>> 
>>> Thanks.
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs-data-scan tool

2018-09-27 Thread John Spray
On Thu, Sep 27, 2018 at 11:34 AM Sergey Malinin  wrote:
>
> Can such behaviour be related to data pool cache tiering?

Yes -- if there's a cache tier in use then deletions in the base pool
can be delayed and then happen later when the cache entries get
expired.

You may find that for a full scan of objects in the system, having a
cache pool actually slows things down quite a lot, due to the overhead
of promoting things in and out of the cache as we scan.

John

>
>
> > On 27.09.2018, at 13:14, Sergey Malinin  wrote:
> >
> > I'm trying alternate metadata pool approach. I double checked that MDS 
> > servers are down and both original and recovery fs are set not joinable.
> >
> >
> >> On 27.09.2018, at 13:10, John Spray  wrote:
> >>
> >> On Thu, Sep 27, 2018 at 11:03 AM Sergey Malinin  wrote:
> >>>
> >>> Hello,
> >>> Does anybody have experience with using cephfs-data-scan tool?
> >>> Questions I have are how long would it take to scan extents on filesystem 
> >>> with 120M relatively small files? While running extents scan I noticed 
> >>> that number of objects in data pool is decreasing over the time. Is that 
> >>> normal?
> >>
> >> The scan_extents operation does not do any deletions, so that is
> >> surprising.  Is it possible that you've accidentially left an MDS
> >> running?
> >>
> >> John
> >>
> >> John
> >>
> >>> Thanks.
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs-data-scan tool

2018-09-27 Thread Sergey Malinin
> 
> On 27.09.2018, at 15:04, John Spray  wrote:
> 
> On Thu, Sep 27, 2018 at 11:34 AM Sergey Malinin  wrote:
>> 
>> Can such behaviour be related to data pool cache tiering?
> 
> Yes -- if there's a cache tier in use then deletions in the base pool
> can be delayed and then happen later when the cache entries get
> expired.
> 
> You may find that for a full scan of objects in the system, having a
> cache pool actually slows things down quite a lot, due to the overhead
> of promoting things in and out of the cache as we scan.

'forward' cache mode is still reported dangerous. Is it safe enough to switch 
to forward mode while doing recovery?


> 
> John
> 
>> 
>> 
>>> On 27.09.2018, at 13:14, Sergey Malinin  wrote:
>>> 
>>> I'm trying alternate metadata pool approach. I double checked that MDS 
>>> servers are down and both original and recovery fs are set not joinable.
>>> 
>>> 
 On 27.09.2018, at 13:10, John Spray  wrote:
 
 On Thu, Sep 27, 2018 at 11:03 AM Sergey Malinin  wrote:
> 
> Hello,
> Does anybody have experience with using cephfs-data-scan tool?
> Questions I have are how long would it take to scan extents on filesystem 
> with 120M relatively small files? While running extents scan I noticed 
> that number of objects in data pool is decreasing over the time. Is that 
> normal?
 
 The scan_extents operation does not do any deletions, so that is
 surprising.  Is it possible that you've accidentially left an MDS
 running?
 
 John
 
 John
 
> Thanks.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Mimic cluster is offline and not healing

2018-09-27 Thread by morphin
Hello,

I am writing this e-mail about an incident that has started last weekend.
There seems to something wrong with my e-mail. Some of my e-mails did
not reach-out. So I decided to start an new thread here and start from
begining.
One can find the email related e-mail thread
(http://lists.ceph.com/pipermail/ceph-community-ceph.com/2018-September/000292.html).

We have a cluster with 28 servers and 168 OSDs. OSDs are blustore on
NL-SAS (non SMR) and WAL+DB is NvME. My distro is Archlinux.

Last weekend I have upgraded from 12.2.4 to 13.2.1. And cluster did
not start since OSDs were stuck in booting state. Sage helped me about
it (thanks!) by creating MONs store.db from OSDs via ceph-object-tool.
At first everything was perfect.

However two days later I had an most unfortunate accident. 7 of my
servers crashed at the same time. When they came up cluster was in
HEALTH_ERR state. 2 of those servers were MONs (I have 3 total).

I’ve been working for 3days collecting and testing. But I could not
make any progress.

First of all I’ve double checked OS health, network health, disk
health and they have no problem. Then my further investigation results
are these:
I have rbd pool. There is 33TB of VM data.
As soon as OSD starts it makes lots of I/O on blustore disks (NL-SAS).
This makes OSD near unresponsive. Yu can’t even injectargs.
Cluster does not settle. I left it alone for 24 hour but OSD up count
dropped to ~50.
OSDs are loging too much slow request.
OSDs are loging lots of heartbeat messages. And eventually they are
marked as down.

Latest cluster status: https://paste.ubuntu.com/p/BhCHmVNZsX/
Ceph.conf : https://paste.ubuntu.com/p/FtY9gfpncN/
Sample OSD log: https://paste.ubuntu.com/p/ZsqpcQVRsj/
Mon log: https://paste.ubuntu.com/p/9T8QtMYZWT/
I/O utilization on disks: https://paste.ubuntu.com/p/mrCTKYpBZR/



SO I think my problem is really weird. Somehow pool cannot heal itself.

OSDs make %95 disk I/O utilization and peering is way too slow.. The
OSD I/O didnt end after 72 hours.

Because of the high I/O OSD's cant get an answer from other OSD's and
complains to the monitor. Monitor marking them "down" but I see OSD's
still running.

For example the "ceph -s" command says 50 OSD is up but I see 153 osd
process running at background and trying to reach other OSD's. So it
is very confusing and certainly not progressing.

We're trying every possible strategy. Now we stopped OSDs. Then we
start one OSD at a time with a server. First we start, wait for OSD to
finish I/O than move the next OSD in the same server. We figured-out
that even if the first OSDs I/O is finished second OSD triggers it
again. So when we started the final sitxh OSD, the rest of five OSDs
did &95 I/O too. And first OSD I/O finished in 8 minutes. But sixth
OSD I/O finished in 34 minutes!

Then we moved the next server. As soon as we started this servers OSD
the previously finished OSD started to do I/O again. So we gained
nothing.

Now we are plaining to set noup, then start all 168 OSDs and then
unset noup. Maybe this will prevent OSDs to make I/O over and over
again.

After 72 hours I believe we may hit a bug. Any help would be greatly
appreciated.

We're on IRC 7/24. Thanks to: Be:El, peetaur2, degreaser, Ti and IcePic.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic cluster is offline and not healing

2018-09-27 Thread Stefan Kooman
Quoting by morphin (morphinwith...@gmail.com):
> After 72 hours I believe we may hit a bug. Any help would be greatly
> appreciated.

Is it feasible for you to stop all client IO to the Ceph cluster? At
least until it stabilizes again. "ceph osd pause" would do the trick
(ceph osd unpause would unset it). 

What kind of workload are you running on the cluster? How does your
crush map looks like (ceph osd getcrushmap -o  /tmp/crush_raw; 
crushtool -d /tmp/crush_raw -o /tmp/crush_edit)?

I have seen a (test) Ceph cluster "healing" itself to the point there was
nothing left to recover on. In *that* case the disks were overbooked
(multiple OSDs per physical disk) ... The flags you set (nooout, nodown,
nobackfill, norecover, noscrub, etc., etc.) helped to get it to recover
again. I would try to get all OSDs online again (and manually keep them
up / restart them, because you have set nodown).

Does the cluster recover at all?

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic cluster is offline and not healing

2018-09-27 Thread by morphin
I should not have client I/O right now. All of my VMs are down right
now. There is only a single pool.

Here is my crush map: https://paste.ubuntu.com/p/Z9G5hSdqCR/

Cluster does not recover. After starting OSDs with the specified
flags, OSD up count drops from 168 to 50 with in 24 hours.
Stefan Kooman , 27 Eyl 2018 Per, 16:10 tarihinde şunu yazdı:
>
> Quoting by morphin (morphinwith...@gmail.com):
> > After 72 hours I believe we may hit a bug. Any help would be greatly
> > appreciated.
>
> Is it feasible for you to stop all client IO to the Ceph cluster? At
> least until it stabilizes again. "ceph osd pause" would do the trick
> (ceph osd unpause would unset it).
>
> What kind of workload are you running on the cluster? How does your
> crush map looks like (ceph osd getcrushmap -o  /tmp/crush_raw;
> crushtool -d /tmp/crush_raw -o /tmp/crush_edit)?
>
> I have seen a (test) Ceph cluster "healing" itself to the point there was
> nothing left to recover on. In *that* case the disks were overbooked
> (multiple OSDs per physical disk) ... The flags you set (nooout, nodown,
> nobackfill, norecover, noscrub, etc., etc.) helped to get it to recover
> again. I would try to get all OSDs online again (and manually keep them
> up / restart them, because you have set nodown).
>
> Does the cluster recover at all?
>
> Gr. Stefan
>
> --
> | BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
> | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cephfs new file in ganesha mount Input/output error

2018-09-27 Thread Marc Roos


If I add on one client a file to the cephfs, that is exported via 
ganesha and nfs mounted somewhere else. I can see it in the dir listing 
on the other nfs client. But trying to read it gives an Input/output 
error. Other files (older ones in the same dir I can read)

Anyone had this also?


nfs-ganesha-xfs-2.6.1-0.1.el7.x86_64
nfs-ganesha-2.6.1-0.1.el7.x86_64
nfs-ganesha-mem-2.6.1-0.1.el7.x86_64
nfs-ganesha-vfs-2.6.1-0.1.el7.x86_64
nfs-ganesha-rgw-2.6.1-0.1.el7.x86_64
nfs-ganesha-ceph-2.6.1-0.1.el7.x86_64

ceph-12.2.8-0.el7.x86_64
ceph-base-12.2.8-0.el7.x86_64
ceph-common-12.2.8-0.el7.x86_64
ceph-mds-12.2.8-0.el7.x86_64
ceph-mgr-12.2.8-0.el7.x86_64
ceph-mon-12.2.8-0.el7.x86_64
ceph-osd-12.2.8-0.el7.x86_64
ceph-radosgw-12.2.8-0.el7.x86_64
ceph-selinux-12.2.8-0.el7.x86_64
collectd-ceph-5.8.0-2.el7.x86_64
libcephfs2-12.2.8-0.el7.x86_64
nfs-ganesha-ceph-2.6.1-0.1.el7.x86_64
python-cephfs-12.2.8-0.el7.x86_64



## These are defaults for exports.  They can be overridden per-export.
EXPORT_DEFAULTS {
## Access type for clients.  Default is None, so some access 
must be
## given either here or in the export itself.
Transports = TCP;
Protocols = 4,3;
Squash = root_id_squash;
anonymous_uid = 500;
anonymous_gid = 500;
Access_Type = RW;
}

## Configure settings for the object handle cache
CACHEINODE {
## The point at which object cache entries will start being 
reused.
Entries_HWMark = 100;
}

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [CEPH]-[RADOS] Deduplication feature status

2018-09-27 Thread Gaël THEROND
Hi folks!

As I’ll soon start to work on a new really large an distributed CEPH
project for cold data storage, I’m checking out a few features availability
and status, with the need for deduplication among them.

I found out an interesting video about that from Cephalocon APAC 2018 and a
seven years old bugtrack (
https://tracker.ceph.com/issues/1576), but that doesn’t really answered my
questions.

Suppose that I want to base this project on mimic, is deduplication kinda
supported, if so at which level and to which extend ?

Thanks a lot for your information !
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] slow export of cephfs through samba

2018-09-27 Thread Chad W Seys
Hi all,
   I am exporting cephfs using samba.  It is much slower over samba than 
direct. Anyone know how to speed it up?
   Benchmarked using bonnie++ 5 times either directly to cephfs mounted 
by kernel (v4.18.6) module:
bonnie++ -> kcephfs
or through a cifs kernel-module-mounted (protocol version 3.02) Samba 
(v4.8.5) share on the same machine.
bonnie++ -> Samba -> kcephfs

Abbreviated results for 5 runs:
kcephfs:  min  max
  file created 555  619   files/sec
  sequential block input:  106.44   108.13MB/sec
  sequential block output: 102.82   110.61MB/sec

(There is a gigabit network between the client and the ceph cluster, so 
the block in/out is pleasing.)

samba -> kcephfs: min  max
  file created 45   files/sec
  sequential block input:  22.8529.5MB/sec
  sequential block output: 27.9530.01   MB/sec

The block input/output is okay fast, but the files created per second is 
low.  Anyone know how to tweak Samba to speed it up?
   Would Samba vfs_ceph speed up access?  At the moment vfs_ceph in 
Debian depends on libceph1 10.2.5, so not too modern.

Current Samba settings:
[global]
 dns proxy = No
 hostname lookups = Yes
 kerberos method = secrets and keytab
 logging = syslog@1 /var/log/samba/log.%m
 max log size = 10
 panic action = /usr/share/samba/panic-action %d
 realm = PHYSICS.WISC.EDU
 security = USER
 server signing = required
 server string = %h server
 workgroup = PHYSICS
 fruit:nfs_aces = no
 idmap config * : backend = tdb
[smb]
 ea support = Yes
 inherit acls = Yes
 inherit permissions = Yes
 msdfs root = Yes
 path = /srv/smb
 read only = No
 smb encrypt = desired
 vfs objects = catia fruit streams_xattr
 fruit:encoding = native

Thanks!
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [CEPH]-[RADOS] Deduplication feature status

2018-09-27 Thread ceph
As of today, there is no such feature in Ceph

Best regards,


On 09/27/2018 04:34 PM, Gaël THEROND wrote:
> Hi folks!
> 
> As I’ll soon start to work on a new really large an distributed CEPH
> project for cold data storage, I’m checking out a few features availability
> and status, with the need for deduplication among them.
> 
> I found out an interesting video about that from Cephalocon APAC 2018 and a
> seven years old bugtrack (
> https://tracker.ceph.com/issues/1576), but that doesn’t really answered my
> questions.
> 
> Suppose that I want to base this project on mimic, is deduplication kinda
> supported, if so at which level and to which extend ?
> 
> Thanks a lot for your information !
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CRUSH puzzle: step weighted-take

2018-09-27 Thread Dan van der Ster
Dear Ceph friends,

I have a CRUSH data migration puzzle and wondered if someone could
think of a clever solution.

Consider an osd tree like this:

  -2   4428.02979 room 0513-R-0050
 -72911.81897 rack RA01
  -4917.27899 rack RA05
  -6917.25500 rack RA09
  -9786.23901 rack RA13
 -14895.43903 rack RA17
 -65   1161.16003 room 0513-R-0060
 -71578.76001 ipservice S513-A-IP38
 -70287.56000 rack BA09
 -80291.20001 rack BA10
 -76582.40002 ipservice S513-A-IP63
 -75291.20001 rack BA11
 -78291.20001 rack BA12

In the beginning, for reasons that are not important, we created two pools:
  * poolA chooses room=0513-R-0050 then replicates 3x across the racks.
  * poolB chooses room=0513-R-0060, replicates 2x across the
ipservices, then puts a 3rd replica in room 0513-R-0050.

For clarity, here is the crush rule for poolB:
type replicated
min_size 1
max_size 10
step take 0513-R-0060
step chooseleaf firstn 2 type ipservice
step emit
step take 0513-R-0050
step chooseleaf firstn -2 type rack
step emit

Now to the puzzle.
For reasons that are not important, we now want to change the rule for
poolB to put all three 3 replicas in room 0513-R-0060.
And we need to do this in a way which is totally non-disruptive
(latency-wise) to the users of either pools. (These are both *very*
active RBD pools).

I see two obvious ways to proceed:
  (1) simply change the rule for poolB to put a third replica on any
osd in room 0513-R-0060. I'm afraid though that this would involve way
too many concurrent backfills, cluster-wide, even with
osd_max_backfills=1.
  (2) change poolB size to 2, then change the crush rule to that from
(1), then reset poolB size to 3. This would risk data availability
during the time that the pool is size=2, and also risks that every osd
in room 0513-R-0050 would be too busy deleting for some indeterminate
time period (10s of minutes, I expect).

So I would probably exclude those two approaches.

Conceptually what I'd like to be able to do is a gradual migration,
which if I may invent some syntax on the fly...

Instead of
   step take 0513-R-0050
do
   step weighted-take 99 0513-R-0050 1 0513-R-0060

That is, 99% of the time take room 0513-R-0050 for the 3rd copies, 1%
of the time take room 0513-R-0060.
With a mechanism like that, we could gradually adjust those "step
weighted-take" lines until 100% of the 3rd copies were in 0513-R-0060.

I have a feeling that something equivalent to that is already possible
with weight-sets or some other clever crush trickery.
Any ideas?

Best Regards,

Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRUSH puzzle: step weighted-take

2018-09-27 Thread Luis Periquito
I think your objective is to move the data without anyone else
noticing. What I usually do is reduce the priority of the recovery
process as much as possible. Do note this will make the recovery take
a looong time, and will also make recovery from failures slow...
ceph tell osd.* injectargs '--osd_recovery_sleep 0.9'
ceph tell osd.* injectargs '--osd-max-backfills 1'
ceph tell osd.* injectargs '--osd-recovery-op-priority 1'
ceph tell osd.* injectargs '--osd-client-op-priority 63'
ceph tell osd.* injectargs '--osd-recovery-max-active 1'
ceph tell osd.* injectargs '--osd_recovery_max_chunk 524288'

I would also assume you have set osd_scrub_during_recovery to false.



On Thu, Sep 27, 2018 at 4:19 PM Dan van der Ster  wrote:
>
> Dear Ceph friends,
>
> I have a CRUSH data migration puzzle and wondered if someone could
> think of a clever solution.
>
> Consider an osd tree like this:
>
>   -2   4428.02979 room 0513-R-0050
>  -72911.81897 rack RA01
>   -4917.27899 rack RA05
>   -6917.25500 rack RA09
>   -9786.23901 rack RA13
>  -14895.43903 rack RA17
>  -65   1161.16003 room 0513-R-0060
>  -71578.76001 ipservice S513-A-IP38
>  -70287.56000 rack BA09
>  -80291.20001 rack BA10
>  -76582.40002 ipservice S513-A-IP63
>  -75291.20001 rack BA11
>  -78291.20001 rack BA12
>
> In the beginning, for reasons that are not important, we created two pools:
>   * poolA chooses room=0513-R-0050 then replicates 3x across the racks.
>   * poolB chooses room=0513-R-0060, replicates 2x across the
> ipservices, then puts a 3rd replica in room 0513-R-0050.
>
> For clarity, here is the crush rule for poolB:
> type replicated
> min_size 1
> max_size 10
> step take 0513-R-0060
> step chooseleaf firstn 2 type ipservice
> step emit
> step take 0513-R-0050
> step chooseleaf firstn -2 type rack
> step emit
>
> Now to the puzzle.
> For reasons that are not important, we now want to change the rule for
> poolB to put all three 3 replicas in room 0513-R-0060.
> And we need to do this in a way which is totally non-disruptive
> (latency-wise) to the users of either pools. (These are both *very*
> active RBD pools).
>
> I see two obvious ways to proceed:
>   (1) simply change the rule for poolB to put a third replica on any
> osd in room 0513-R-0060. I'm afraid though that this would involve way
> too many concurrent backfills, cluster-wide, even with
> osd_max_backfills=1.
>   (2) change poolB size to 2, then change the crush rule to that from
> (1), then reset poolB size to 3. This would risk data availability
> during the time that the pool is size=2, and also risks that every osd
> in room 0513-R-0050 would be too busy deleting for some indeterminate
> time period (10s of minutes, I expect).
>
> So I would probably exclude those two approaches.
>
> Conceptually what I'd like to be able to do is a gradual migration,
> which if I may invent some syntax on the fly...
>
> Instead of
>step take 0513-R-0050
> do
>step weighted-take 99 0513-R-0050 1 0513-R-0060
>
> That is, 99% of the time take room 0513-R-0050 for the 3rd copies, 1%
> of the time take room 0513-R-0060.
> With a mechanism like that, we could gradually adjust those "step
> weighted-take" lines until 100% of the 3rd copies were in 0513-R-0060.
>
> I have a feeling that something equivalent to that is already possible
> with weight-sets or some other clever crush trickery.
> Any ideas?
>
> Best Regards,
>
> Dan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever

2018-09-27 Thread David Turner
I got pulled away from this for a while.  The error in the log is "abort:
Corruption: Snappy not supported or corrupted Snappy compressed block
contents" and the OSD has 2 settings set to snappy by default,
async_compressor_type and bluestore_compression_algorithm.  Do either of
these settings affect the omap store?

On Wed, Sep 19, 2018 at 2:33 PM Pavan Rallabhandi <
prallabha...@walmartlabs.com> wrote:

> Looks like you are running on CentOS, fwiw. We’ve successfully ran the
> conversion commands on Jewel, Ubuntu 16.04.
>
> Have a feel it’s expecting the compression to be enabled, can you try
> removing “compression=kNoCompression” from the filestore_rocksdb_options?
> And/or you might want to check if rocksdb is expecting snappy to be enabled.
>
> From: David Turner 
> Date: Tuesday, September 18, 2018 at 6:01 PM
> To: Pavan Rallabhandi 
> Cc: ceph-users 
> Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the
> cluster unusable and takes forever
>
> Here's the [1] full log from the time the OSD was started to the end of
> the crash dump.  These logs are so hard to parse.  Is there anything useful
> in them?
>
> I did confirm that all perms were set correctly and that the superblock
> was changed to rocksdb before the first time I attempted to start the OSD
> with it's new DB.  This is on a fully Luminous cluster with [2] the
> defaults you mentioned.
>
> [1] https://gist.github.com/drakonstein/fa3ac0ad9b2ec1389c957f95e05b79ed
> [2] "filestore_omap_backend": "rocksdb",
> "filestore_rocksdb_options":
> "max_background_compactions=8,compaction_readahead_size=2097152,compression=kNoCompression",
>
> On Tue, Sep 18, 2018 at 5:29 PM Pavan Rallabhandi  prallabha...@walmartlabs.com> wrote:
> I meant the stack trace hints that the superblock still has leveldb in it,
> have you verified that already?
>
> On 9/18/18, 5:27 PM, "Pavan Rallabhandi"  prallabha...@walmartlabs.com> wrote:
>
> You should be able to set them under the global section and that
> reminds me, since you are on Luminous already, I guess those values are
> already the default, you can verify from the admin socket of any OSD.
>
> But the stack trace didn’t hint as if the superblock on the OSD is
> still considering the omap backend to be leveldb and to do with the
> compression.
>
> Thanks,
> -Pavan.
>
> From: David Turner 
> Date: Tuesday, September 18, 2018 at 5:07 PM
> To: Pavan Rallabhandi 
> Cc: ceph-users 
> Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the
> cluster unusable and takes forever
>
> Are those settings fine to have be global even if not all OSDs on a
> node have rocksdb as the backend?  Or will I need to convert all OSDs on a
> node at the same time?
>
> On Tue, Sep 18, 2018 at 5:02 PM Pavan Rallabhandi  prallabha...@walmartlabs.com> wrote:
> The steps that were outlined for conversion are correct, have you
> tried setting some the relevant ceph conf values too:
>
> filestore_rocksdb_options =
> "max_background_compactions=8;compaction_readahead_size=2097152;compression=kNoCompression"
>
> filestore_omap_backend = rocksdb
>
> Thanks,
> -Pavan.
>
> From: ceph-users  on
> behalf of David Turner 
> Date: Tuesday, September 18, 2018 at 4:09 PM
> To: ceph-users 
> Subject: EXT: [ceph-users] Any backfill in our cluster makes the
> cluster unusable and takes forever
>
> I've finally learned enough about the OSD backend track down this
> issue to what I believe is the root cause.  LevelDB compaction is the
> common thread every time we move data around our cluster.  I've ruled out
> PG subfolder splitting, EC doesn't seem to be the root cause of this, and
> it is cluster wide as opposed to specific hardware.
>
> One of the first things I found after digging into leveldb omap
> compaction was [1] this article with a heading "RocksDB instead of LevelDB"
> which mentions that leveldb was replaced with rocksdb as the default db
> backend for filestore OSDs and was even backported to Jewel because of the
> performance improvements.
>
> I figured there must be a way to be able to upgrade an OSD to use
> rocksdb from leveldb without needing to fully backfill the entire OSD.
> There is [2] this article, but you need to have an active service account
> with RedHat to access it.  I eventually came across [3] this article about
> optimizing Ceph Object Storage which mentions a resolution to OSDs flapping
> due to omap compaction to migrate to using rocksdb.  It links to the RedHat
> article, but also has [4] these steps outlined in it.  I tried to follow
> the steps, but the OSD I tested this on was unable to start with [5] this
> segfault.  And then trying to move the OSD back to

Re: [ceph-users] Mimic cluster is offline and not healing

2018-09-27 Thread by morphin
I think I might find something.
When I start an OSD its making High I/O  around %95 and the other OSDs
are also triggered and altogether they make same the I/O. This is true
even if when I set noup flag. So all the OSDs are making high I/O when
ever an OSD starts.

I think this is too much. I have 168 OSD and when I start them OSD I/O
job never finishes. I let the cluster for 70 hours and the high I/O
never finished at all.

We're trying to start OSD's host by host and wait for settlement but
it takes too much time.
OSD can not even answer "ceph tell osd.158 version". So if it becomes
so busy and this seems to be a loop since another OSD startup triggers
other OSD I/O.

So I debug it and I hope this can be examined.

This is debug=20 OSD log :
Full log:  https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
Less log: Only the last part before the high I/O is finished:
https://paste.ubuntu.com/p/7ZfwH8CBC5/
Strace -f -P osd;
- When I start the osd: https://paste.ubuntu.com/p/8n2kTvwnG6/
- After I/O is finished: https://paste.ubuntu.com/p/4sGfj7Bf4c/

Now some people in IRC says this is a bug, try Ubuntu and new Ceph
repo maybe it will help. I agree with them and I will give a shot.
What do you think?
by morphin , 27 Eyl 2018 Per, 16:27
tarihinde şunu yazdı:
>
> I should not have client I/O right now. All of my VMs are down right
> now. There is only a single pool.
>
> Here is my crush map: https://paste.ubuntu.com/p/Z9G5hSdqCR/
>
> Cluster does not recover. After starting OSDs with the specified
> flags, OSD up count drops from 168 to 50 with in 24 hours.
> Stefan Kooman , 27 Eyl 2018 Per, 16:10 tarihinde şunu yazdı:
> >
> > Quoting by morphin (morphinwith...@gmail.com):
> > > After 72 hours I believe we may hit a bug. Any help would be greatly
> > > appreciated.
> >
> > Is it feasible for you to stop all client IO to the Ceph cluster? At
> > least until it stabilizes again. "ceph osd pause" would do the trick
> > (ceph osd unpause would unset it).
> >
> > What kind of workload are you running on the cluster? How does your
> > crush map looks like (ceph osd getcrushmap -o  /tmp/crush_raw;
> > crushtool -d /tmp/crush_raw -o /tmp/crush_edit)?
> >
> > I have seen a (test) Ceph cluster "healing" itself to the point there was
> > nothing left to recover on. In *that* case the disks were overbooked
> > (multiple OSDs per physical disk) ... The flags you set (nooout, nodown,
> > nobackfill, norecover, noscrub, etc., etc.) helped to get it to recover
> > again. I would try to get all OSDs online again (and manually keep them
> > up / restart them, because you have set nodown).
> >
> > Does the cluster recover at all?
> >
> > Gr. Stefan
> >
> > --
> > | BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
> > | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever

2018-09-27 Thread Pavan Rallabhandi
I see Filestore symbols on the stack, so the bluestore config doesn’t affect. 
And the top frame of the stack hints at a RocksDB issue, and there are a whole 
lot of these too:

“2018-09-17 19:23:06.480258 7f1f3d2a7700  2 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/table/block_based_table_reader.cc:636]
 Cannot find Properties block from file.”

It really seems to be something with RocksDB on centOS. I still think you can 
try removing “compression=kNoCompression” from the filestore_rocksdb_options 
And/Or check if rocksdb is expecting snappy to be enabled.

Thanks,
-Pavan.

From: David Turner 
Date: Thursday, September 27, 2018 at 1:18 PM
To: Pavan Rallabhandi 
Cc: ceph-users 
Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster 
unusable and takes forever

I got pulled away from this for a while.  The error in the log is "abort: 
Corruption: Snappy not supported or corrupted Snappy compressed block contents" 
and the OSD has 2 settings set to snappy by default, async_compressor_type and 
bluestore_compression_algorithm.  Do either of these settings affect the omap 
store?

On Wed, Sep 19, 2018 at 2:33 PM Pavan Rallabhandi 
 wrote:
Looks like you are running on CentOS, fwiw. We’ve successfully ran the 
conversion commands on Jewel, Ubuntu 16.04.

Have a feel it’s expecting the compression to be enabled, can you try removing 
“compression=kNoCompression” from the filestore_rocksdb_options? And/or you 
might want to check if rocksdb is expecting snappy to be enabled.

From: David Turner 
Date: Tuesday, September 18, 2018 at 6:01 PM
To: Pavan Rallabhandi 
Cc: ceph-users 
Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster 
unusable and takes forever

Here's the [1] full log from the time the OSD was started to the end of the 
crash dump.  These logs are so hard to parse.  Is there anything useful in them?

I did confirm that all perms were set correctly and that the superblock was 
changed to rocksdb before the first time I attempted to start the OSD with it's 
new DB.  This is on a fully Luminous cluster with [2] the defaults you 
mentioned.

[1] https://gist.github.com/drakonstein/fa3ac0ad9b2ec1389c957f95e05b79ed
[2] "filestore_omap_backend": "rocksdb",
"filestore_rocksdb_options": 
"max_background_compactions=8,compaction_readahead_size=2097152,compression=kNoCompression",

On Tue, Sep 18, 2018 at 5:29 PM Pavan Rallabhandi 
 wrote:
I meant the stack trace hints that the superblock still has leveldb in it, have 
you verified that already?

On 9/18/18, 5:27 PM, "Pavan Rallabhandi" 
 wrote:

    You should be able to set them under the global section and that reminds 
me, since you are on Luminous already, I guess those values are already the 
default, you can verify from the admin socket of any OSD.

    But the stack trace didn’t hint as if the superblock on the OSD is still 
considering the omap backend to be leveldb and to do with the compression.

    Thanks,
    -Pavan.

    From: David Turner 
    Date: Tuesday, September 18, 2018 at 5:07 PM
    To: Pavan Rallabhandi 
    Cc: ceph-users 
    Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the 
cluster unusable and takes forever

    Are those settings fine to have be global even if not all OSDs on a node 
have rocksdb as the backend?  Or will I need to convert all OSDs on a node at 
the same time?

    On Tue, Sep 18, 2018 at 5:02 PM Pavan Rallabhandi 
 wrote:
    The steps that were outlined for conversion are correct, have you tried 
setting some the relevant ceph conf values too:

    filestore_rocksdb_options = 
"max_background_compactions=8;compaction_readahead_size=2097152;compression=kNoCompression"

    filestore_omap_backend = rocksdb

    Thanks,
    -Pavan.

    From: ceph-users 
 on behalf of 
David Turner 
    Date: Tuesday, September 18, 2018 at 4:09 PM
    To: ceph-users 
    Subject: EXT: [ceph-users] Any backfill in our cluster makes the cluster 
unusable and takes forever

    I've finally learned enough about the OSD backend track down this issue to 
what I believe is the root cause.  LevelDB compaction is the common thread 
every time we move data around our cluster.  I've ruled out PG subfolder 
splitting, EC doesn't seem to be the root cause of this

[ceph-users] Is object name used by CRUSH algorithm?

2018-09-27 Thread Jin Mao
I am running luminous and the objects were copied from Isilon with a long
and similar prefix in path like /dir1/dir2/dir3//mm/dd. The objects are
copied to various buckets like bucket_MMDD/dir1/dir2/dir3//mm/dd.
This setup minimize some internal code change when moving from NFS to
object store.

I heard that CRUSH may NOT evenly balance OSDs if there are many common
leading characters in the object name? However, I couldn't find any
evidence to support this.

Does anyone know further details about this?

Thank you.

Jin.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRUSH puzzle: step weighted-take

2018-09-27 Thread Maged Mokhtar




On 27/09/18 17:18, Dan van der Ster wrote:

Dear Ceph friends,

I have a CRUSH data migration puzzle and wondered if someone could
think of a clever solution.

Consider an osd tree like this:

   -2   4428.02979 room 0513-R-0050
  -72911.81897 rack RA01
   -4917.27899 rack RA05
   -6917.25500 rack RA09
   -9786.23901 rack RA13
  -14895.43903 rack RA17
  -65   1161.16003 room 0513-R-0060
  -71578.76001 ipservice S513-A-IP38
  -70287.56000 rack BA09
  -80291.20001 rack BA10
  -76582.40002 ipservice S513-A-IP63
  -75291.20001 rack BA11
  -78291.20001 rack BA12

In the beginning, for reasons that are not important, we created two pools:
   * poolA chooses room=0513-R-0050 then replicates 3x across the racks.
   * poolB chooses room=0513-R-0060, replicates 2x across the
ipservices, then puts a 3rd replica in room 0513-R-0050.

For clarity, here is the crush rule for poolB:
 type replicated
 min_size 1
 max_size 10
 step take 0513-R-0060
 step chooseleaf firstn 2 type ipservice
 step emit
 step take 0513-R-0050
 step chooseleaf firstn -2 type rack
 step emit

Now to the puzzle.
For reasons that are not important, we now want to change the rule for
poolB to put all three 3 replicas in room 0513-R-0060.
And we need to do this in a way which is totally non-disruptive
(latency-wise) to the users of either pools. (These are both *very*
active RBD pools).

I see two obvious ways to proceed:
   (1) simply change the rule for poolB to put a third replica on any
osd in room 0513-R-0060. I'm afraid though that this would involve way
too many concurrent backfills, cluster-wide, even with
osd_max_backfills=1.
   (2) change poolB size to 2, then change the crush rule to that from
(1), then reset poolB size to 3. This would risk data availability
during the time that the pool is size=2, and also risks that every osd
in room 0513-R-0050 would be too busy deleting for some indeterminate
time period (10s of minutes, I expect).

So I would probably exclude those two approaches.

Conceptually what I'd like to be able to do is a gradual migration,
which if I may invent some syntax on the fly...

Instead of
step take 0513-R-0050
do
step weighted-take 99 0513-R-0050 1 0513-R-0060

That is, 99% of the time take room 0513-R-0050 for the 3rd copies, 1%
of the time take room 0513-R-0060.
With a mechanism like that, we could gradually adjust those "step
weighted-take" lines until 100% of the 3rd copies were in 0513-R-0060.

I have a feeling that something equivalent to that is already possible
with weight-sets or some other clever crush trickery.
Any ideas?

Best Regards,

Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
would it be possible in your case to create a parent datacenter bucket 
to hold both rooms and assign their relative weights there, then for the 
third replica do a step take to this parent bucket ? its not elegant but 
may do the trick.
The suggested step weighted-take would be more flexible as it can be 
changed on a replica level, but i do not know if you can do this with 
existing code.


Maged


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic cluster is offline and not healing

2018-09-27 Thread by morphin
Good news... :)

After I tried everything. I decide to re-create my MONs from OSD's and
I used the script:
https://paste.ubuntu.com/p/rNMPdMPhT5/

And it worked!!!
I think when 2 server crashed and come back same time some how MON's
confused and the maps just corrupted.
After re-creation all the MONs was have the same map so it worked.
But still I dont know how to hell the mons can cause endless %95 I/O ???
This a bug anyway and if you dont want to leave the problem then do
not "enable" your mons. Just start them manual! Another tough lesson.

ceph -s: https://paste.ubuntu.com/p/m3hFF22jM9/

As you can see below some of the OSDs are still down. And when I start
them they dont start.
Check start log: https://paste.ubuntu.com/p/ZJQG4khdbx/
Debug log: https://paste.ubuntu.com/p/J3JyGShHym/

What we can do for the problem?
What is the cause of the problem?

Thank you everyone. You helped me a lot! :)
>
> I think I might find something.
> When I start an OSD its making High I/O  around %95 and the other OSDs
> are also triggered and altogether they make same the I/O. This is true
> even if when I set noup flag. So all the OSDs are making high I/O when
> ever an OSD starts.
>
> I think this is too much. I have 168 OSD and when I start them OSD I/O
> job never finishes. I let the cluster for 70 hours and the high I/O
> never finished at all.
>
> We're trying to start OSD's host by host and wait for settlement but
> it takes too much time.
> OSD can not even answer "ceph tell osd.158 version". So if it becomes
> so busy and this seems to be a loop since another OSD startup triggers
> other OSD I/O.
>
> So I debug it and I hope this can be examined.
>
> This is debug=20 OSD log :
> Full log:  https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
> Less log: Only the last part before the high I/O is finished:
> https://paste.ubuntu.com/p/7ZfwH8CBC5/
> Strace -f -P osd;
> - When I start the osd: https://paste.ubuntu.com/p/8n2kTvwnG6/
> - After I/O is finished: https://paste.ubuntu.com/p/4sGfj7Bf4c/
>
> Now some people in IRC says this is a bug, try Ubuntu and new Ceph
> repo maybe it will help. I agree with them and I will give a shot.
> What do you think?
> by morphin , 27 Eyl 2018 Per, 16:27
> tarihinde şunu yazdı:
> >
> > I should not have client I/O right now. All of my VMs are down right
> > now. There is only a single pool.
> >
> > Here is my crush map: https://paste.ubuntu.com/p/Z9G5hSdqCR/
> >
> > Cluster does not recover. After starting OSDs with the specified
> > flags, OSD up count drops from 168 to 50 with in 24 hours.
> > Stefan Kooman , 27 Eyl 2018 Per, 16:10 tarihinde şunu yazdı:
> > >
> > > Quoting by morphin (morphinwith...@gmail.com):
> > > > After 72 hours I believe we may hit a bug. Any help would be greatly
> > > > appreciated.
> > >
> > > Is it feasible for you to stop all client IO to the Ceph cluster? At
> > > least until it stabilizes again. "ceph osd pause" would do the trick
> > > (ceph osd unpause would unset it).
> > >
> > > What kind of workload are you running on the cluster? How does your
> > > crush map looks like (ceph osd getcrushmap -o  /tmp/crush_raw;
> > > crushtool -d /tmp/crush_raw -o /tmp/crush_edit)?
> > >
> > > I have seen a (test) Ceph cluster "healing" itself to the point there was
> > > nothing left to recover on. In *that* case the disks were overbooked
> > > (multiple OSDs per physical disk) ... The flags you set (nooout, nodown,
> > > nobackfill, norecover, noscrub, etc., etc.) helped to get it to recover
> > > again. I would try to get all OSDs online again (and manually keep them
> > > up / restart them, because you have set nodown).
> > >
> > > Does the cluster recover at all?
> > >
> > > Gr. Stefan
> > >
> > > --
> > > | BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
> > > | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-ansible

2018-09-27 Thread solarflow99
Thanks guys, installing this package did the trick, it works now.



On Mon, Sep 24, 2018 at 8:39 AM Ken Dreyer  wrote:

> Hi Alfredo,
>
> I've packaged the latest version in Fedora, but I didn't update EPEL.
> I've submitted the update for EPEL now at
> https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2018-7f8d3be3e2 .
> solarflow99, you can test this package and report "+1" in Bodhi there.
>
> It's also in the CentOS Storage SIG
> (http://cbs.centos.org/koji/buildinfo?buildID=23004) . Today I've
> tagged that build in CBS into storage7-ceph-luminous-testing and
> storage7-ceph-mimic-testing, so it will show up at
> https://buildlogs.centos.org/centos/7/storage/x86_64/ceph-luminous/
> soon. solarflow99, you could test this as well (although CentOS does
> not have a feedback mechanism like Fedora's Bodhi yet)
> On Fri, Sep 21, 2018 at 4:43 AM Alfredo Deza  wrote:
> >
> > On Thu, Sep 20, 2018 at 7:04 PM solarflow99 
> wrote:
> > >
> > > oh, was that all it was...  git clone
> https://github.com/ceph/ceph-ansible/
> > > I installed the notario  package from EPEL,
> python2-notario-0.0.11-2.el7.noarch  and thats the newest they have
> >
> > Hey Ken, I thought the latest versions were being packaged, is there
> > something I've missed? The tags have changed format it seems, from
> > 0.0.11
> > >
> > >
> > >
> > >
> > > On Thu, Sep 20, 2018 at 3:57 PM Alfredo Deza  wrote:
> > >>
> > >> Not sure how you installed ceph-ansible, the requirements mention a
> > >> version of a dependency (the notario module) which needs to be 0.0.13
> > >> or newer, and you seem to be using an older one.
> > >>
> > >>
> > >> On Thu, Sep 20, 2018 at 6:53 PM solarflow99 
> wrote:
> > >> >
> > >> > Hi, tying to get this to do a simple deployment, and i'm getting a
> strange error, has anyone seen this?  I'm using Centos 7, rel 5   ansible
> 2.5.3  python version = 2.7.5
> > >> >
> > >> > I've tried with mimic luninous and even jewel, no luck at all.
> > >> >
> > >> >
> > >> >
> > >> > TASK [ceph-validate : validate provided configuration]
> **
> > >> > task path:
> /home/jzygmont/ansible/ceph-ansible/roles/ceph-validate/tasks/main.yml:2
> > >> > Thursday 20 September 2018  14:05:18 -0700 (0:00:05.734)
>  0:00:37.439 
> > >> > The full traceback is:
> > >> > Traceback (most recent call last):
> > >> >   File
> "/usr/lib/python2.7/site-packages/ansible/executor/task_executor.py", line
> 138, in run
> > >> > res = self._execute()
> > >> >   File
> "/usr/lib/python2.7/site-packages/ansible/executor/task_executor.py", line
> 561, in _execute
> > >> > result = self._handler.run(task_vars=variables)
> > >> >   File
> "/home/jzygmont/ansible/ceph-ansible/plugins/actions/validate.py", line 43,
> in run
> > >> > notario.validate(host_vars, install_options, defined_keys=True)
> > >> > TypeError: validate() got an unexpected keyword argument
> 'defined_keys'
> > >> >
> > >> > fatal: [172.20.3.178]: FAILED! => {
> > >> > "msg": "Unexpected failure during module execution.",
> > >> > "stdout": ""
> > >> > }
> > >> >
> > >> > NO MORE HOSTS LEFT
> **
> > >> >
> > >> > PLAY RECAP
> **
> > >> > 172.20.3.178   : ok=25   changed=0unreachable=0
> failed=1
> > >> >
> > >> > ___
> > >> > ceph-users mailing list
> > >> > ceph-users@lists.ceph.com
> > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRUSH puzzle: step weighted-take

2018-09-27 Thread Goncalo Borges
Hi Dan

Hope to find you ok.

Here goes a suggestion from someone who has been sitting in the side line
for the last 2 years but following stuff as much as possible

Will weight set per pool help?

This is only possible in luminous but according to the docs there is the
possibility to adjust positional weights for devices hosting replicas of
objects for a given bucket.

Cheers
Goncalo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com