Re: [ceph-users] Posix AIO vs libaio read performance

2017-03-10 Thread Alexandre DERUMIER
>>Regarding rbd cache, is something I will try -today I was thinking about it- 
>>but I did not try it yet because I don't want to reduce write speed.

note that rbd_cache only work for sequential writes. so it don't help for 
random writes.

also, internaly, qemu force use of aio=threads with cache=writeback is enable, 
but can use aio=native with cache=none.



- Mail original -
De: "Xavier Trilla" 
À: "aderumier" 
Cc: "ceph-users" 
Envoyé: Vendredi 10 Mars 2017 14:12:59
Objet: Re: [ceph-users] Posix AIO vs libaio read performance

Hi Alexandre, 

Debugging is disabled in client and osds. 

Regarding rbd cache, is something I will try -today I was thinking about it- 
but I did not try it yet because I don't want to reduce write speed. 

I also tried iothreads, but no benefit. 

I tried as well with virtio-blk and virtio-scsi, there is a small improvement 
with virtio-blk, but it's around a 10%. 

This is becoming a quite strange issue, as it only affects posix aio read 
performance. Nothing less seems to be affected -although posix aio write isn't 
nowhere near libaio performance-. 

Thanks for you help, if you have any other ideas they will be really 
appreciated. 

Also if somebody could run in their cluster from inside a VM the following 
command: 



fio --name=randread-posix --output ./test --runtime 60 --ioengine=posixaio 
--buffered=0 --direct=1 --rw=randread --bs=4k --size=1024m --iodepth=32 



It would be really helpful to know if I'm the only one affected or this is 
happening in all qemu + ceph setups. 

Thanks! 
Xavier 

El 10 mar 2017, a las 8:07, Alexandre DERUMIER < [ mailto:aderum...@odiso.com | 
aderum...@odiso.com ] > escribió: 


BQ_BEGIN



BQ_BEGIN

BQ_BEGIN
But it still looks like there is some bottleneck in QEMU o Librbd I cannot 
manage to find. 

BQ_END

BQ_END

you can improve latency on client with disable debug. 

on your client, create a /etc/ceph/ceph.conf with 

[global] 
debug asok = 0/0 
debug auth = 0/0 
debug buffer = 0/0 
debug client = 0/0 
debug context = 0/0 
debug crush = 0/0 
debug filer = 0/0 
debug filestore = 0/0 
debug finisher = 0/0 
debug heartbeatmap = 0/0 
debug journal = 0/0 
debug journaler = 0/0 
debug lockdep = 0/0 
debug mds = 0/0 
debug mds balancer = 0/0 
debug mds locker = 0/0 
debug mds log = 0/0 
debug mds log expire = 0/0 
debug mds migrator = 0/0 
debug mon = 0/0 
debug monc = 0/0 
debug ms = 0/0 
debug objclass = 0/0 
debug objectcacher = 0/0 
debug objecter = 0/0 
debug optracker = 0/0 
debug osd = 0/0 
debug paxos = 0/0 
debug perfcounter = 0/0 
debug rados = 0/0 
debug rbd = 0/0 
debug rgw = 0/0 
debug throttle = 0/0 
debug timer = 0/0 
debug tp = 0/0 


you can also disable rbd_cache=false or in qemu set cache=none. 

Using iothread on qemu drive should help a little bit too. 

- Mail original - 
De: "Xavier Trilla" < [ mailto:xavier.tri...@silicontower.net | 
xavier.tri...@silicontower.net ] > 
À: "ceph-users" < [ mailto:ceph-users@lists.ceph.com | 
ceph-users@lists.ceph.com ] > 
Envoyé: Vendredi 10 Mars 2017 05:37:01 
Objet: Re: [ceph-users] Posix AIO vs libaio read performance 



Hi, 



We compiled Hammer .10 to use jemalloc and now the cluster performance improved 
a lot, but POSIX AIO operations are still quite slower than libaio. 



Now with a single thread read operations are about 1000 per second and write 
operations about 5000 per second. 



Using same FIO configuration, but libaio read operations are about 15K per 
second and writes 12K per second. 



I’m compiling QEMU with jemalloc support as well, and I’m planning to replace 
librbd in QEMU hosts to the new one using jemalloc. 



But it still looks like there is some bottleneck in QEMU o Librbd I cannot 
manage to find. 



Any help will be much appreciated. 



Thanks. 






De: ceph-users [ [ mailto:ceph-users-boun...@lists.ceph.com | 
mailto:ceph-users-boun...@lists.ceph.com ] ] En nombre de Xavier Trilla 
Enviado el: jueves, 9 de marzo de 2017 6:56 
Para: [ mailto:ceph-users@lists.ceph.com | ceph-users@lists.ceph.com ] 
Asunto: [ceph-users] Posix AIO vs libaio read performance 




Hi, 



I’m trying to debut why there is a big difference using POSIX AIO and libaio 
when performing read tests from inside a VM using librbd. 



The results I’m getting using FIO are: 



POSIX AIO Read: 



Type: Random Read - IO Engine: POSIX AIO - Buffered: No - Direct: Yes - Block 
Size: 4KB - Disk Target: /: 



Average: 2.54 MB/s 

Average: 632 IOPS 



Libaio Read: 



Type: Random Read - IO Engine: Libaio - Buffered: No - Direct: Yes - Block 
Size: 4KB - Disk Target: /: 



Average: 147.88 MB/s 

Average: 36967 IOPS 



When performing writes the differences aren’t so big, because the cluster 
–which is in production right now- is CPU bonded: 



POSIX AIO Write: 



Type: Random Write - IO Engine: POSIX AIO - Buffered: No - Direct: Yes - Block 
Size: 4KB - Disk Target: /: 



Average: 14.87 MB/s 

Average: 3713 IOPS 



Libaio Write: 



Type: Random Write - 

[ceph-users] Latest Jewel New OSD Creation

2017-03-10 Thread Ashley Merrick
Hello,

I am trying to add a new OSD to my CEPH Cluster, I am running Proxmox so 
attempted to as normal via the GUI as normal however received an error output 
at the following command:

ceph-disk prepare --zap-disk --fs-type xfs --cluster ceph --cluster-uuid 
51c1b5c5-e510-4ed3-8b09-417214edb3f4 --journal-dev /dev/sdc /dev/sdm1

Output : ceph-disk: Error: journal specified but not allowed by osd backend

This is only happening since updated to v10.2.6, it looks like ceph-disk for 
some reason is maybe thinking the OSD should be a bluestore OSD?

Ashley
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph with RDMA

2017-03-10 Thread Haomai Wang
On Sat, Mar 11, 2017 at 10:29 AM, PR PR  wrote:
> Thanks for the quick reply. I tried it with master as well. Followed
> instructions on this link - https://community.mellanox.com/docs/DOC-2721
>
> Ceph mon fails to start with error "unrecognized ms_type 'async+rdma'"

it must be ceph-mon doesn't compile with rdma support

>
> Appreciate any pointers.
>
> On Thu, Mar 9, 2017 at 5:56 PM, Haomai Wang  wrote:
>>
>> On Fri, Mar 10, 2017 at 4:28 AM, PR PR  wrote:
>> > Hi,
>> >
>> > I am trying to use ceph with RDMA. I have a few questions.
>> >
>> > 1. Is there a prebuilt package that has rdma support or the only way to
>> > try
>> > ceph+rdma is to checkout from github and compile from scratch?
>> >
>> > 2. Looks like there are two ways of using rdma - xio and async+rdma.
>> > Which
>> > is the recommended approach? Also, any insights on the differences will
>> > be
>> > useful as well.
>> >
>> > 3. async+rdma seems to have lot of recent changes. Is 11.2.0 expected to
>> > work for async+rdma? As when I compiled 11.2.0 it fails with following
>> > error
>> >
>>
>> suggest checkout with master
>>
>> > [ 81%] Built target rbd
>> > /mnt/ceph_compile/ceph/build/lib/libcephfs.so: undefined reference to
>> > `ibv_free_device_list'
>> > /mnt/ceph_compile/ceph/build/lib/libcephfs.so: undefined reference to
>> > `ibv_get_cq_event'
>> > /mnt/ceph_compile/ceph/build/lib/libcephfs.so: undefined reference to
>> > `ibv_alloc_pd'
>> > /mnt/ceph_compile/ceph/build/lib/libcephfs.so: undefined reference to
>> > `ibv_close_device'
>> > /mnt/ceph_compile/ceph/build/lib/libcephfs.so: undefined reference to
>> > `ibv_destroy_qp'
>> > /mnt/ceph_compile/ceph/build/lib/libcephfs.so: undefined reference to
>> > `ibv_modify_qp'
>> > /mnt/ceph_compile/ceph/build/lib/libcephfs.so: undefined reference to
>> > `ibv_get_async_event'
>> > ***snipped***
>> > Link Error: Ceph FS library not found
>> > src/pybind/cephfs/CMakeFiles/cython_cephfs.dir/build.make:57: recipe for
>> > target 'src/pybind/cephfs/CMakeFiles/cython_cephfs' failed
>> > make[2]: *** [src/pybind/cephfs/CMakeFiles/cython_cephfs] Error 1
>> > CMakeFiles/Makefile2:4015: recipe for target
>> > 'src/pybind/cephfs/CMakeFiles/cython_cephfs.dir/all' failed
>> > make[1]: *** [src/pybind/cephfs/CMakeFiles/cython_cephfs.dir/all] Error
>> > 2
>> > make[1]: *** Waiting for unfinished jobs
>> > [ 85%] Built target rgw_a
>> >
>> > Thanks,
>> > PR
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph with RDMA

2017-03-10 Thread PR PR
Thanks for the quick reply. I tried it with master as well. Followed
instructions on this link - https://community.mellanox.com/docs/DOC-2721

Ceph mon fails to start with error "unrecognized ms_type 'async+rdma'"

Appreciate any pointers.

On Thu, Mar 9, 2017 at 5:56 PM, Haomai Wang  wrote:

> On Fri, Mar 10, 2017 at 4:28 AM, PR PR  wrote:
> > Hi,
> >
> > I am trying to use ceph with RDMA. I have a few questions.
> >
> > 1. Is there a prebuilt package that has rdma support or the only way to
> try
> > ceph+rdma is to checkout from github and compile from scratch?
> >
> > 2. Looks like there are two ways of using rdma - xio and async+rdma.
> Which
> > is the recommended approach? Also, any insights on the differences will
> be
> > useful as well.
> >
> > 3. async+rdma seems to have lot of recent changes. Is 11.2.0 expected to
> > work for async+rdma? As when I compiled 11.2.0 it fails with following
> error
> >
>
> suggest checkout with master
>
> > [ 81%] Built target rbd
> > /mnt/ceph_compile/ceph/build/lib/libcephfs.so: undefined reference to
> > `ibv_free_device_list'
> > /mnt/ceph_compile/ceph/build/lib/libcephfs.so: undefined reference to
> > `ibv_get_cq_event'
> > /mnt/ceph_compile/ceph/build/lib/libcephfs.so: undefined reference to
> > `ibv_alloc_pd'
> > /mnt/ceph_compile/ceph/build/lib/libcephfs.so: undefined reference to
> > `ibv_close_device'
> > /mnt/ceph_compile/ceph/build/lib/libcephfs.so: undefined reference to
> > `ibv_destroy_qp'
> > /mnt/ceph_compile/ceph/build/lib/libcephfs.so: undefined reference to
> > `ibv_modify_qp'
> > /mnt/ceph_compile/ceph/build/lib/libcephfs.so: undefined reference to
> > `ibv_get_async_event'
> > ***snipped***
> > Link Error: Ceph FS library not found
> > src/pybind/cephfs/CMakeFiles/cython_cephfs.dir/build.make:57: recipe for
> > target 'src/pybind/cephfs/CMakeFiles/cython_cephfs' failed
> > make[2]: *** [src/pybind/cephfs/CMakeFiles/cython_cephfs] Error 1
> > CMakeFiles/Makefile2:4015: recipe for target
> > 'src/pybind/cephfs/CMakeFiles/cython_cephfs.dir/all' failed
> > make[1]: *** [src/pybind/cephfs/CMakeFiles/cython_cephfs.dir/all] Error
> 2
> > make[1]: *** Waiting for unfinished jobs
> > [ 85%] Built target rgw_a
> >
> > Thanks,
> > PR
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pgs stuck inactive

2017-03-10 Thread Brad Hubbard
So this is why it happened I guess.

pool 3 'volumes' replicated size 3 min_size 1

min_size = 1 is a recipe for disasters like this and there are plenty
of ML threads about not setting it below 2.

The past intervals in the pg query show several intervals where a
single OSD may have gone rw.

How important is this data?

I would suggest checking which of these OSDs actually have the data
for this pg. From the pg query it looks like 2, 35 and 68 and possibly
28 since it's the primary. Check all OSDs in the pg query output. I
would then back up all copies and work out which copy, if any, you
want to keep and then attempt something like the following.

https://www.mail-archive.com/ceph-users@lists.ceph.com/msg17820.html

If you want to abandon the pg see
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/012778.html
for a possible solution.

http://ceph.com/community/incomplete-pgs-oh-my/ may also give some ideas.


On Fri, Mar 10, 2017 at 9:44 PM, Laszlo Budai  wrote:
> The OSDs are all there.
>
> $ sudo ceph osd stat
>  osdmap e60609: 72 osds: 72 up, 72 in
>
> an I have attached the result of ceph osd tree, and ceph osd dump commands.
> I got some extra info about the network problem. A faulty network device has
> flooded the network eating up all the bandwidth so the OSDs were not able to
> properly communicate with each other. This has lasted for almost 1 day.
>
> Thank you,
> Laszlo
>
>
>
> On 10.03.2017 12:19, Brad Hubbard wrote:
>>
>> To me it looks like someone may have done an "rm" on these OSDs but
>> not removed them from the crushmap. This does not happen
>> automatically.
>>
>> Do these OSDs show up in "ceph osd tree" and "ceph osd dump" ? If so,
>> paste the output.
>>
>> Without knowing what exactly happened here it may be difficult to work
>> out how to proceed.
>>
>> In order to go clean the primary needs to communicate with multiple
>> OSDs, some of which are marked DNE and seem to be uncontactable.
>>
>> This seems to be more than a network issue (unless the outage is still
>> happening).
>>
>>
>> http://docs.ceph.com/docs/master/rados/operations/pg-states/?highlight=incomplete
>>
>>
>>
>> On Fri, Mar 10, 2017 at 6:09 PM, Laszlo Budai 
>> wrote:
>>>
>>> Hello,
>>>
>>> I was informed that due to a networking issue the ceph cluster network
>>> was
>>> affected. There was a huge packet loss, and network interfaces were
>>> flipping. That's all I got.
>>> This outage has lasted a longer period of time. So I assume that some OSD
>>> may have been considered dead and the data from them has been moved away
>>> to
>>> other PGs (this is what ceph is supposed to do if I'm correct). Probably
>>> that was the point when the listed PGs have appeared into the picture.
>>> From the query we can see this for one of those OSDs:
>>> {
>>> "peer": "14",
>>> "pgid": "3.367",
>>> "last_update": "0'0",
>>> "last_complete": "0'0",
>>> "log_tail": "0'0",
>>> "last_user_version": 0,
>>> "last_backfill": "MAX",
>>> "purged_snaps": "[]",
>>> "history": {
>>> "epoch_created": 4,
>>> "last_epoch_started": 54899,
>>> "last_epoch_clean": 55143,
>>> "last_epoch_split": 0,
>>> "same_up_since": 60603,
>>> "same_interval_since": 60603,
>>> "same_primary_since": 60593,
>>> "last_scrub": "2852'33528",
>>> "last_scrub_stamp": "2017-02-26 02:36:55.210150",
>>> "last_deep_scrub": "2852'16480",
>>> "last_deep_scrub_stamp": "2017-02-21 00:14:08.866448",
>>> "last_clean_scrub_stamp": "2017-02-26 02:36:55.210150"
>>> },
>>> "stats": {
>>> "version": "0'0",
>>> "reported_seq": "14",
>>> "reported_epoch": "59779",
>>> "state": "down+peering",
>>> "last_fresh": "2017-02-27 16:30:16.230519",
>>> "last_change": "2017-02-27 16:30:15.267995",
>>> "last_active": "0.00",
>>> "last_peered": "0.00",
>>> "last_clean": "0.00",
>>> "last_became_active": "0.00",
>>> "last_became_peered": "0.00",
>>> "last_unstale": "2017-02-27 16:30:16.230519",
>>> "last_undegraded": "2017-02-27 16:30:16.230519",
>>> "last_fullsized": "2017-02-27 16:30:16.230519",
>>> "mapping_epoch": 60601,
>>> "log_start": "0'0",
>>> "ondisk_log_start": "0'0",
>>> "created": 4,
>>> "last_epoch_clean": 55143,
>>> "parent": "0.0",
>>> "parent_split_bits": 0,
>>> "last_scrub": "2852'33528",
>>> "last_scrub_stamp": "2017-02-26 02:36:55.210150",
>>>

[ceph-users] http://www.dell.com/support/home/us/en/04/product-support/servicetag/JFGQY02/warranty#

2017-03-10 Thread Anthony D'Atri
> As long as you don?t nuke the OSDs or the journals, you should be OK.

This.  Most HBA failures I’ve experienced don’t corrupt data on the drives, bit 
it can happen.

Assuming the data is okay, you should be able to just install the OS, install 
the *same version* of Ceph packages, reboot, and have them come up and in (and 
backfill / recover with a vengeance)


— aad

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unable to boot OS on cluster node

2017-03-10 Thread Lincoln Bryant
Hi Shain,

As long as you don’t nuke the OSDs or the journals, you should be OK. I think 
the keyring and such are typically stored on the OSD itself. If you have lost 
track of what physical device maps to what OSD, you can always mount the OSDs 
in a temporary spot and cat the “whoami” file.

—Lincoln

> On Mar 10, 2017, at 11:33 AM, Shain Miley  wrote:
> 
> Hello,
> 
> We had an issue with one of our Dell 720xd servers and now the raid card 
> cannot seem to boot from the Ubuntu OS drive volume.
> 
> I would like to know...if I reload the OS...is there an easy way to get the 
> 12 OSD's disks back into the cluster without just having to remove them from 
> the cluster, wipe the drives and then re-add them?
> 
> Right now I have the 'noout' and 'nodown' flags set on the cluster so there 
> has been no data movement yet as a result of this node being down.
> 
> Thanks in advance for any help.
> 
> Shain
> 
> 
> -- 
> NPR | Shain Miley | Manager of Infrastructure, Digital Media | smi...@npr.org 
> | 202.513.3649
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unable to boot OS on cluster node

2017-03-10 Thread Xavier Trilla
Hi Shain,

Not talking from experience, but as far as I now -from how ceph works- I guess 
is enough if you reinstall the system, install ceph again, add ceph.conf and 
keys, and udev will do the rest. Maybe you'll need to restart the server after 
you've done everything, but ceph should find the OSDs it-self.

As far as I know OSDs are detected by udev -at least in Ubuntu you don't have 
entries for them on fstab- and started. But maybe you'll have to start them 
manually.

But again, I never done that, and I'm just talking from what I've seen of how 
ceph works. (Also if using a different linux release it may work differently.)

Good luck!

Xavier.


-Mensaje original-
De: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] En nombre de Shain 
Miley
Enviado el: viernes, 10 de marzo de 2017 18:35
Para: ceph-users 
Asunto: [ceph-users] Unable to boot OS on cluster node

Hello,

We had an issue with one of our Dell 720xd servers and now the raid card cannot 
seem to boot from the Ubuntu OS drive volume.

I would like to know...if I reload the OS...is there an easy way to get the 12 
OSD's disks back into the cluster without just having to remove them from the 
cluster, wipe the drives and then re-add them?

Right now I have the 'noout' and 'nodown' flags set on the cluster so there has 
been no data movement yet as a result of this node being down.

Thanks in advance for any help.

Shain


--
NPR | Shain Miley | Manager of Infrastructure, Digital Media | smi...@npr.org | 
202.513.3649

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Posix AIO vs libaio read performance

2017-03-10 Thread Xavier Trilla
Hi Jason,


Just to add more information: 

- The issue doesn't seem to be fio or glibc (guest) related, as it is working 
properly on other environments using the same software versions. Also I've 
tried using Ubuntu 14.04 and 16.04 and I'm getting really similar results, but 
I'll ran more tests just to be 100% sure.
- If I increase the number of concurrent jobs in fio (F.e. 16) results are much 
better (They get above 10k IOPS)
- I'm seeing similar bad results when using KRBD, but I still need to run more 
tests on this front (I'm using KRBD from inside a VM, because in our 
infrastructure getting your hands on a test physical machine it's quite 
difficult, but I'll manage. The VM has 10G connection, and I'm mounting the RBD 
volume from inside the VM using the kernel module -4.4- so the result should 
give an idea of how KRBD will perform)
- I'm not seeing improvements with librbd compiled with jemalloc support.
- No difference between QEMU 2.0, 2.5 or 2.7

Looks like it's related with an interaction of how POSIX AIO handles the direct 
reads and how Ceph works -but it could also be KVM related-. I could argue it's 
related with being a networked storage, but for example in other environments 
like Amazon EBS I'm not seeing this issue, but obviously I don't have any idea 
about EBS internals (But I guess that's what we are trying to match... if it 
works properly on EBS it should work properly on Ceph too ;) Also, I'm still 
trying to verify if this is just related to my setup or affects all ceph 
installations. 

One of the things I find more strange, is the performance difference in the 
read department. Libaio performance is way better in both read and write, but 
the biggest difference is between posix aio read and librbd read.

BTW: Do you have a test environment were you could test fio using posix aio? 
I've been running tests in our production and test cluster, but they run almost 
the same version (hammer) of everything :/ Maybe I'll try to deploy a new 
cluster using jewel -if I can get my hands on enough hardware-. Here are the 
command lines for FIO:

POSIX AIO:
fio --name=randread-posix --runtime 60 --ioengine=posixaio --buffered=0 
--direct=1 --rw=randread --bs=4k --size=1024m --iodepth=32

Libaio:

fio --name=randread-libaio --runtime 60 --ioengine=libaio --buffered=0 
--direct=1 --rw=randread --bs=4k --size=1024m --iodepth=32

Also thanks for the blktrace tip, on Monday I'll start playing with it and I'll 
post my findings.

Thanks!
Xavier

-Mensaje original-
De: Jason Dillaman [mailto:jdill...@redhat.com] 
Enviado el: viernes, 10 de marzo de 2017 19:18
Para: Xavier Trilla 
CC: Alexandre DERUMIER ; ceph-users 

Asunto: Re: [ceph-users] Posix AIO vs libaio read performance

librbd doesn't know that you are using libaio vs POSIX AIO. Therefore, the best 
bet is that the issue is in fio or glibc. As a first step, I would recommend 
using blktrace (or similar) within your VM to determine if there is a delta 
between libaio and POSIX AIO at the block level.

On Fri, Mar 10, 2017 at 12:28 PM, Xavier Trilla 
 wrote:
> I disabled rbd cache but no improvement, just a huge performance drop 
> in writes (Which proves the cache was properly disabled).
>
>
>
> Now I’m working on two other fronts:
>
>
>
> -Using librbd with jemalloc in the Hypervisors (Hammer .10)
>
> -Compiling QEMU with jemalloc (QEMU 2.6)
>
> -Running some tests from a Bare Metal server using FIO tool, but it
> will use the librbd directly so no way to simulate POSIX AIO (Maybe 
> I’ll try via KRBD)
>
>
>
> I’m quite sure is something on the client side, but I don’t know 
> enough about the Ceph internals to totally discard the issue being related to 
> OSDs.
> But so far performance of the OSDs is really good using other test 
> engines, so I’m working more on the client side.
>
>
>
> Any help or information would be really welcome J
>
>
>
> Thanks.
>
> Xavier.
>
>
>
> De: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] En nombre de 
> Xavier Trilla Enviado el: viernes, 10 de marzo de 2017 14:13
> Para: Alexandre DERUMIER 
> CC: ceph-users 
> Asunto: Re: [ceph-users] Posix AIO vs libaio read performance
>
>
>
> Hi Alexandre,
>
>
>
> Debugging is disabled in client and osds.
>
>
>
> Regarding rbd cache, is something I will try -today I was thinking 
> about it- but I did not try it yet because I don't want to reduce write speed.
>
>
>
> I also tried iothreads, but no benefit.
>
>
>
> I tried as well with virtio-blk and virtio-scsi, there is a small 
> improvement with virtio-blk, but it's around a 10%.
>
>
>
> This is becoming a quite strange issue, as it only affects posix aio 
> read performance. Nothing less seems to be affected -although posix 
> aio write isn't nowhere near libaio performance-.
>
>
>
> Thanks for you help, if you have any other ideas they will be really 
> appreciated.
>
>
>
> Also if somebody could run in their cluster from inside a VM the 
> following
> command:
>
>
>
> 

Re: [ceph-users] 答复: How does ceph preserve read/write consistency?

2017-03-10 Thread Gregory Farnum
On Thu, Mar 9, 2017 at 7:20 PM 许雪寒  wrote:

> Thanks for your reply.
>
> As the log shows, in our test, a READ that come after a WRITE did finished
> before that WRITE.


This is where you've gone astray. Any storage system is perfectly free to
reorder simultaneous requests -- defined as those whose submit-reply time
overlaps. So you submitted write W, then submitted read R, then got a
response to R before W. That's allowed, and preventing it is actually
impossible in general. In the specific case you've outlined, we *could* try
to prevent it, but doing so is pretty ludicrously expensive and, since the
"reorder" can happen anyway, doesn't provide any benefit.
So we don't try. :)

That said, obviously we *do* provide strict ordering across write
boundaries: a read submitted after a write completed will always see the
results of that write.
-Greg

And I read the source code, it seems that, for writes, in
> ReplicatedPG::do_op method, the thread in OSD_op_tp calls
> ReplicatedPG::get_rw_lock method which tries to get RWState::RWWRITE. If it
> fails, the op will be put into obc->rwstate.waiters queue and be requeued
> when repop finishes, however, the OSD_op_tp's thread doesn't wait for repop
> and tries to get the next OP. Can this be the cause?
>
> -邮件原件-
> 发件人: Wei Jin [mailto:wjin...@gmail.com]
> 发送时间: 2017年3月9日 21:52
> 收件人: 许雪寒
> 抄送: ceph-users@lists.ceph.com
> 主题: Re: [ceph-users] How does ceph preserve read/write consistency?
>
> On Thu, Mar 9, 2017 at 1:45 PM, 许雪寒  wrote:
> > Hi, everyone.
>
> > As shown above, WRITE req with tid 1312595 arrived at 18:58:27.439107
> and READ req with tid 6476 arrived at 18:59:55.030936, however, the latter
> finished at 19:00:20:89 while the former finished commit at
> 19:00:20.335061 and filestore write at 19:00:25.202321. And in these logs,
> we found that between the start and finish of each req, there was a lot of
> "dequeue_op" of that req. We read the source code, it seems that this is
> due to "RWState", is that correct?
> >
> > And also, it seems that OSD won't distinguish reqs from different
> clients, so is it possible that io reqs from the same client also finish in
> a different order than that they were created in? Could this affect the
> read/write consistency? For instance, that a read can't acquire the data
> that were written by the same client just before it.
> >
>
> IMO, that doesn't make sense for rados to distinguish reqs from different
> clients.
> Clients or Users should do it by themselves.
>
> However, as for one specific client, ceph can and must guarantee the
> request order.
>
> 1) ceph messenger (network layer) has in_seq and out_seq when receiving
> and sending message
>
> 2) message will be dispatched or fast dispatched and then be queued in
> ShardedOpWq in order.
>
> If requests belong to different pgs, they may be processed concurrently,
> that's ok.
>
> If requests belong to the same pg, they will be queued in the same shard
> and will be processed in order due to pg lock (both read and write).
> For continuous write, op will be queued in ObjectStore in order due to pg
> lock and ObjectStore has OpSequence to guarantee the order when applying op
> to page cache, that's ok.
>
> With regard to  'read after write' to the same object, ceph must guarantee
> read can get the correct write content. That's done by
> ondisk_read/write_lock in ObjectContext.
>
>
> > We are testing hammer version, 0.94.5.  Please help us, thank you:-)
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] can a OSD affect performance from pool X when blocking/slow requests PGs from pool Y ?

2017-03-10 Thread Gregory Farnum
On Tue, Mar 7, 2017 at 10:18 AM Alejandro Comisario 
wrote:

> Gregory, thanks for the response, what you've said is by far, the most
> enlightneen thing i know about ceph in a long time.
>
> What brings even greater doubt, which is, this "non-functional" pool, was
> only 1.5GB large, vs 50-150GB on the other effected pools, the tiny pool
> was still being used, and just because that pool was blovking requests, the
> whole cluster was unresponsive.
>
> So , what do you mean by "non-functional" pool ? how a pool can become
> non-functional ? and what asures me that tomorrow (just becaue i deleted
> the 1.5GB pool to fix the whole problem) another pool doesnt becomes
> non-functional ?
>

Well, you said there were a bunch of slow requests. That can happen any
number of ways, if you're overloading the OSDs or something.
When there are slow requests, those ops take up OSD memory and throttle,
and so they don't let in new messages until the old ones are serviced. This
can cascade across a cluster -- because everything is interconnected,
clients and OSDs end up with all their requests targeted at the slow OSDs
which aren't letting in new IO quickly enough. It's one of the weaknesses
of the standard deployment patterns, but it usually doesn't come up unless
something else has gone pretty wrong first.
As for what actually went wrong here, you haven't provided near enough
information and probably can't now that the pool has been deleted. *shrug*
-Greg




> Ceph Bug ?
> Another Bug ?
> Something than can be avoided ?
>
>
> On Tue, Mar 7, 2017 at 2:11 PM, Gregory Farnum  wrote:
>
> Some facts:
> The OSDs use a lot of gossip protocols to distribute information.
> The OSDs limit how many client messages they let in to the system at a
> time.
> The OSDs do not distinguish between client ops for different pools (the
> blocking happens before they have any idea what the target is).
>
> So, yes: if you have a non-functional pool and clients keep trying to
> access it, those requests can fill up the OSD memory queues and block
> access to other pools as it cascades across the system.
>
> On Sun, Mar 5, 2017 at 6:22 PM Alejandro Comisario 
> wrote:
>
> Hi, we have a 7 nodes ubuntu ceph hammer pool (78 OSD to be exact).
> This weekend we'be experienced a huge outage from our customers vms
> (located on pool CUSTOMERS, replica size 3 ) when lots of OSD's
> started to slow request/block PG's on pool PRIVATE ( replica size 1 )
> basically all PG's blocked where just one OSD in the acting set, but
> all customers on the other pool got their vms almost freezed.
>
> while trying to do basic troubleshooting like doing noout and then
> bringing down the OSD that slowed/blocked the most, inmediatelly
> another OSD slowed/locked iops on pgs from the same PRIVATE pool, so
> we rolled back that change and started to move data around with the
> same logic (reweighting down those OSD) with exactly the same result.
>
> So, me made a decition, we decided to delete the pool where all PGS
> where slowed/locked allways despite the osd.
>
> Not even 10 secconds passes after the pool deletion, where not only
> there were no more degraded PGs, bit also ALL slow iops dissapeared
> for ever, and performance from hundreds of vms came to normal
> immediately.
>
> I must say that i was kinda scared to see that happen, bascally
> because there was only ONE POOL's PGS always slowed, but performance
> hit the another pool, so ... did not the PGS that exists on one pool
> are not shared by the other ?
> If my assertion is true, why OSD's locking iops from one pool's pg
> slowed down all other pgs from other pools ?
>
> again, i just deleted a pool that has almost no traffic, because its
> pgs were locked and affected pgs on another pool, and as soon as that
> happened, the whole cluster came back to normal (and of course,
> HEALTH_OK and no slow transaction whatsoever)
>
> please, someone help me understand the gap where i miss something,
> since this , as long as my ceph knowledge is concerned, makes no
> sense.
>
> PS: i have found someone that , looks like went through the same here:
>
> https://forum.proxmox.com/threads/ceph-osd-failure-causing-proxmox-node-to-crash.20781/
> but i still dont understand what happened.
>
> hoping to get the help from the community.
>
> --
> Alejandrito.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> --
> *Alejandro Comisario*
> *CTO | NUBELIU*
> E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
> _
> www.nubeliu.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Posix AIO vs libaio read performance

2017-03-10 Thread Jason Dillaman
librbd doesn't know that you are using libaio vs POSIX AIO. Therefore,
the best bet is that the issue is in fio or glibc. As a first step, I
would recommend using blktrace (or similar) within your VM to
determine if there is a delta between libaio and POSIX AIO at the
block level.

On Fri, Mar 10, 2017 at 12:28 PM, Xavier Trilla
 wrote:
> I disabled rbd cache but no improvement, just a huge performance drop in
> writes (Which proves the cache was properly disabled).
>
>
>
> Now I’m working on two other fronts:
>
>
>
> -Using librbd with jemalloc in the Hypervisors (Hammer .10)
>
> -Compiling QEMU with jemalloc (QEMU 2.6)
>
> -Running some tests from a Bare Metal server using FIO tool, but it
> will use the librbd directly so no way to simulate POSIX AIO (Maybe I’ll try
> via KRBD)
>
>
>
> I’m quite sure is something on the client side, but I don’t know enough
> about the Ceph internals to totally discard the issue being related to OSDs.
> But so far performance of the OSDs is really good using other test engines,
> so I’m working more on the client side.
>
>
>
> Any help or information would be really welcome J
>
>
>
> Thanks.
>
> Xavier.
>
>
>
> De: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] En nombre de
> Xavier Trilla
> Enviado el: viernes, 10 de marzo de 2017 14:13
> Para: Alexandre DERUMIER 
> CC: ceph-users 
> Asunto: Re: [ceph-users] Posix AIO vs libaio read performance
>
>
>
> Hi Alexandre,
>
>
>
> Debugging is disabled in client and osds.
>
>
>
> Regarding rbd cache, is something I will try -today I was thinking about it-
> but I did not try it yet because I don't want to reduce write speed.
>
>
>
> I also tried iothreads, but no benefit.
>
>
>
> I tried as well with virtio-blk and virtio-scsi, there is a small
> improvement with virtio-blk, but it's around a 10%.
>
>
>
> This is becoming a quite strange issue, as it only affects posix aio read
> performance. Nothing less seems to be affected -although posix aio write
> isn't nowhere near libaio performance-.
>
>
>
> Thanks for you help, if you have any other ideas they will be really
> appreciated.
>
>
>
> Also if somebody could run in their cluster from inside a VM the following
> command:
>
>
>
> fio --name=randread-posix --output ./test --runtime 60 --ioengine=posixaio
> --buffered=0 --direct=1 --rw=randread --bs=4k --size=1024m --iodepth=32
>
>
>
> It would be really helpful to know if I'm the only one affected or this is
> happening in all qemu + ceph setups.
>
> Thanks!
>
> Xavier
>
>
> El 10 mar 2017, a las 8:07, Alexandre DERUMIER 
> escribió:
>
>
>
> But it still looks like there is some bottleneck in QEMU o Librbd I cannot
> manage to find.
>
>
> you can improve latency on client with disable debug.
>
> on your client, create a /etc/ceph/ceph.conf with
>
> [global]
> debug asok = 0/0
> debug auth = 0/0
> debug buffer = 0/0
> debug client = 0/0
> debug context = 0/0
> debug crush = 0/0
> debug filer = 0/0
> debug filestore = 0/0
> debug finisher = 0/0
> debug heartbeatmap = 0/0
> debug journal = 0/0
> debug journaler = 0/0
> debug lockdep = 0/0
> debug mds = 0/0
> debug mds balancer = 0/0
> debug mds locker = 0/0
> debug mds log = 0/0
> debug mds log expire = 0/0
> debug mds migrator = 0/0
> debug mon = 0/0
> debug monc = 0/0
> debug ms = 0/0
> debug objclass = 0/0
> debug objectcacher = 0/0
> debug objecter = 0/0
> debug optracker = 0/0
> debug osd = 0/0
> debug paxos = 0/0
> debug perfcounter = 0/0
> debug rados = 0/0
> debug rbd = 0/0
> debug rgw = 0/0
> debug throttle = 0/0
> debug timer = 0/0
> debug tp = 0/0
>
>
> you can also disable rbd_cache=false   or in qemu set cache=none.
>
> Using iothread on qemu drive should help a little bit too.
>
> - Mail original -
> De: "Xavier Trilla" 
> À: "ceph-users" 
> Envoyé: Vendredi 10 Mars 2017 05:37:01
> Objet: Re: [ceph-users] Posix AIO vs libaio read performance
>
>
>
> Hi,
>
>
>
> We compiled Hammer .10 to use jemalloc and now the cluster performance
> improved a lot, but POSIX AIO operations are still quite slower than libaio.
>
>
>
> Now with a single thread read operations are about 1000 per second and write
> operations about 5000 per second.
>
>
>
> Using same FIO configuration, but libaio read operations are about 15K per
> second and writes 12K per second.
>
>
>
> I’m compiling QEMU with jemalloc support as well, and I’m planning to
> replace librbd in QEMU hosts to the new one using jemalloc.
>
>
>
> But it still looks like there is some bottleneck in QEMU o Librbd I cannot
> manage to find.
>
>
>
> Any help will be much appreciated.
>
>
>
> Thanks.
>
>
>
>
>
>
> De: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] En nombre de
> Xavier Trilla
> Enviado el: jueves, 9 de marzo de 2017 6:56
> Para: ceph-users@lists.ceph.com
> Asunto: [ceph-users] Posix AIO vs libaio read performance
>
>
>
>
> Hi,
>
>
>
> I’m trying to debut why there is a big difference using POSIX AIO and libaio
> when performing read tests from inside a VM using lib

[ceph-users] Unable to boot OS on cluster node

2017-03-10 Thread Shain Miley

Hello,

We had an issue with one of our Dell 720xd servers and now the raid card 
cannot seem to boot from the Ubuntu OS drive volume.


I would like to know...if I reload the OS...is there an easy way to get 
the 12 OSD's disks back into the cluster without just having to remove 
them from the cluster, wipe the drives and then re-add them?


Right now I have the 'noout' and 'nodown' flags set on the cluster so 
there has been no data movement yet as a result of this node being down.


Thanks in advance for any help.

Shain


--
NPR | Shain Miley | Manager of Infrastructure, Digital Media | smi...@npr.org | 
202.513.3649

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Posix AIO vs libaio read performance

2017-03-10 Thread Xavier Trilla
I disabled rbd cache but no improvement, just a huge performance drop in writes 
(Which proves the cache was properly disabled).

Now I'm working on two other fronts:


-Using librbd with jemalloc in the Hypervisors (Hammer .10)

-Compiling QEMU with jemalloc (QEMU 2.6)

-Running some tests from a Bare Metal server using FIO tool, but it 
will use the librbd directly so no way to simulate POSIX AIO (Maybe I'll try 
via KRBD)

I'm quite sure is something on the client side, but I don't know enough about 
the Ceph internals to totally discard the issue being related to OSDs. But so 
far performance of the OSDs is really good using other test engines, so I'm 
working more on the client side.

Any help or information would be really welcome :)

Thanks.
Xavier.

De: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] En nombre de Xavier 
Trilla
Enviado el: viernes, 10 de marzo de 2017 14:13
Para: Alexandre DERUMIER 
CC: ceph-users 
Asunto: Re: [ceph-users] Posix AIO vs libaio read performance

Hi Alexandre,

Debugging is disabled in client and osds.

Regarding rbd cache, is something I will try -today I was thinking about it- 
but I did not try it yet because I don't want to reduce write speed.

I also tried iothreads, but no benefit.

I tried as well with virtio-blk and virtio-scsi, there is a small improvement 
with virtio-blk, but it's around a 10%.

This is becoming a quite strange issue, as it only affects posix aio read 
performance. Nothing less seems to be affected -although posix aio write isn't 
nowhere near libaio performance-.

Thanks for you help, if you have any other ideas they will be really 
appreciated.

Also if somebody could run in their cluster from inside a VM the following 
command:

fio --name=randread-posix --output ./test --runtime 60 --ioengine=posixaio 
--buffered=0 --direct=1 --rw=randread --bs=4k --size=1024m --iodepth=32

It would be really helpful to know if I'm the only one affected or this is 
happening in all qemu + ceph setups.
Thanks!
Xavier

El 10 mar 2017, a las 8:07, Alexandre DERUMIER 
mailto:aderum...@odiso.com>> escribió:


But it still looks like there is some bottleneck in QEMU o Librbd I cannot 
manage to find.

you can improve latency on client with disable debug.

on your client, create a /etc/ceph/ceph.conf with

[global]
debug asok = 0/0
debug auth = 0/0
debug buffer = 0/0
debug client = 0/0
debug context = 0/0
debug crush = 0/0
debug filer = 0/0
debug filestore = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug journal = 0/0
debug journaler = 0/0
debug lockdep = 0/0
debug mds = 0/0
debug mds balancer = 0/0
debug mds locker = 0/0
debug mds log = 0/0
debug mds log expire = 0/0
debug mds migrator = 0/0
debug mon = 0/0
debug monc = 0/0
debug ms = 0/0
debug objclass = 0/0
debug objectcacher = 0/0
debug objecter = 0/0
debug optracker = 0/0
debug osd = 0/0
debug paxos = 0/0
debug perfcounter = 0/0
debug rados = 0/0
debug rbd = 0/0
debug rgw = 0/0
debug throttle = 0/0
debug timer = 0/0
debug tp = 0/0


you can also disable rbd_cache=false   or in qemu set cache=none.

Using iothread on qemu drive should help a little bit too.

- Mail original -
De: "Xavier Trilla" 
mailto:xavier.tri...@silicontower.net>>
À: "ceph-users" mailto:ceph-users@lists.ceph.com>>
Envoyé: Vendredi 10 Mars 2017 05:37:01
Objet: Re: [ceph-users] Posix AIO vs libaio read performance



Hi,



We compiled Hammer .10 to use jemalloc and now the cluster performance improved 
a lot, but POSIX AIO operations are still quite slower than libaio.



Now with a single thread read operations are about 1000 per second and write 
operations about 5000 per second.



Using same FIO configuration, but libaio read operations are about 15K per 
second and writes 12K per second.



I'm compiling QEMU with jemalloc support as well, and I'm planning to replace 
librbd in QEMU hosts to the new one using jemalloc.



But it still looks like there is some bottleneck in QEMU o Librbd I cannot 
manage to find.



Any help will be much appreciated.



Thanks.






De: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] En nombre de Xavier 
Trilla
Enviado el: jueves, 9 de marzo de 2017 6:56
Para: ceph-users@lists.ceph.com
Asunto: [ceph-users] Posix AIO vs libaio read performance




Hi,



I'm trying to debut why there is a big difference using POSIX AIO and libaio 
when performing read tests from inside a VM using librbd.



The results I'm getting using FIO are:



POSIX AIO Read:



Type: Random Read - IO Engine: POSIX AIO - Buffered: No - Direct: Yes - Block 
Size: 4KB - Disk Target: /:



Average: 2.54 MB/s

Average: 632 IOPS



Libaio Read:



Type: Random Read - IO Engine: Libaio - Buffered: No - Direct: Yes - Block 
Size: 4KB - Disk Target: /:



Average: 147.88 MB/s

Average: 36967 IOPS



When performing writes the differences aren't so big, because the cluster 
-which is in production right now- is CPU bonded:



POSIX AIO Write:



Type

Re: [ceph-users] Posix AIO vs libaio read performance

2017-03-10 Thread Xavier Trilla
Hi Alexandre,

Debugging is disabled in client and osds.

Regarding rbd cache, is something I will try -today I was thinking about it- 
but I did not try it yet because I don't want to reduce write speed.

I also tried iothreads, but no benefit.

I tried as well with virtio-blk and virtio-scsi, there is a small improvement 
with virtio-blk, but it's around a 10%.

This is becoming a quite strange issue, as it only affects posix aio read 
performance. Nothing less seems to be affected -although posix aio write isn't 
nowhere near libaio performance-.

Thanks for you help, if you have any other ideas they will be really 
appreciated.

Also if somebody could run in their cluster from inside a VM the following 
command:

fio --name=randread-posix --output ./test --runtime 60 --ioengine=posixaio 
--buffered=0 --direct=1 --rw=randread --bs=4k --size=1024m --iodepth=32

It would be really helpful to know if I'm the only one affected or this is 
happening in all qemu + ceph setups.

Thanks!
Xavier

El 10 mar 2017, a las 8:07, Alexandre DERUMIER 
mailto:aderum...@odiso.com>> escribió:


But it still looks like there is some bottleneck in QEMU o Librbd I cannot 
manage to find.

you can improve latency on client with disable debug.

on your client, create a /etc/ceph/ceph.conf with

[global]
debug asok = 0/0
debug auth = 0/0
debug buffer = 0/0
debug client = 0/0
debug context = 0/0
debug crush = 0/0
debug filer = 0/0
debug filestore = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug journal = 0/0
debug journaler = 0/0
debug lockdep = 0/0
debug mds = 0/0
debug mds balancer = 0/0
debug mds locker = 0/0
debug mds log = 0/0
debug mds log expire = 0/0
debug mds migrator = 0/0
debug mon = 0/0
debug monc = 0/0
debug ms = 0/0
debug objclass = 0/0
debug objectcacher = 0/0
debug objecter = 0/0
debug optracker = 0/0
debug osd = 0/0
debug paxos = 0/0
debug perfcounter = 0/0
debug rados = 0/0
debug rbd = 0/0
debug rgw = 0/0
debug throttle = 0/0
debug timer = 0/0
debug tp = 0/0


you can also disable rbd_cache=false   or in qemu set cache=none.

Using iothread on qemu drive should help a little bit too.

- Mail original -
De: "Xavier Trilla" 
mailto:xavier.tri...@silicontower.net>>
À: "ceph-users" mailto:ceph-users@lists.ceph.com>>
Envoyé: Vendredi 10 Mars 2017 05:37:01
Objet: Re: [ceph-users] Posix AIO vs libaio read performance



Hi,



We compiled Hammer .10 to use jemalloc and now the cluster performance improved 
a lot, but POSIX AIO operations are still quite slower than libaio.



Now with a single thread read operations are about 1000 per second and write 
operations about 5000 per second.



Using same FIO configuration, but libaio read operations are about 15K per 
second and writes 12K per second.



I’m compiling QEMU with jemalloc support as well, and I’m planning to replace 
librbd in QEMU hosts to the new one using jemalloc.



But it still looks like there is some bottleneck in QEMU o Librbd I cannot 
manage to find.



Any help will be much appreciated.



Thanks.






De: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] En nombre de Xavier 
Trilla
Enviado el: jueves, 9 de marzo de 2017 6:56
Para: ceph-users@lists.ceph.com
Asunto: [ceph-users] Posix AIO vs libaio read performance




Hi,



I’m trying to debut why there is a big difference using POSIX AIO and libaio 
when performing read tests from inside a VM using librbd.



The results I’m getting using FIO are:



POSIX AIO Read:



Type: Random Read - IO Engine: POSIX AIO - Buffered: No - Direct: Yes - Block 
Size: 4KB - Disk Target: /:



Average: 2.54 MB/s

Average: 632 IOPS



Libaio Read:



Type: Random Read - IO Engine: Libaio - Buffered: No - Direct: Yes - Block 
Size: 4KB - Disk Target: /:



Average: 147.88 MB/s

Average: 36967 IOPS



When performing writes the differences aren’t so big, because the cluster 
–which is in production right now- is CPU bonded:



POSIX AIO Write:



Type: Random Write - IO Engine: POSIX AIO - Buffered: No - Direct: Yes - Block 
Size: 4KB - Disk Target: /:



Average: 14.87 MB/s

Average: 3713 IOPS



Libaio Write:



Type: Random Write - IO Engine: Libaio - Buffered: No - Direct: Yes - Block 
Size: 4KB - Disk Target: /:



Average: 14.51 MB/s

Average: 3622 IOPS





Even if the write results are CPU bonded, as the machines containing the OSDs 
don’t have enough CPU to handle all the IOPS (CPU upgrades are on its way) I 
cannot really understand why I’m seeing so much difference in the read tests.



Some configuration background:



- Cluster and clients are using Hammer 0.94.90

- It’s a full SSD cluster running over Samsung Enterprise SATA SSDs, with all 
the typical tweaks (Customized ceph.conf, optimized sysctl, etc…)

- Tried QEMU 2.0 and 2.7 – Similar results

- Tried virtio-blk and virtio-scsi – Similar results



I’ve been reading about POSIX AIO and Libaio, and I can see there are several 
differences on how they work (Like one being user space and the 

Re: [ceph-users] Jewel v10.2.6 released

2017-03-10 Thread Götz Reinicke - IT Koordinator

Hi,

Am 08.03.17 um 13:11 schrieb Abhishek L:

This point release fixes several important bugs in RBD mirroring, RGW 
multi-site, CephFS, and RADOS.

We recommend that all v10.2.x users upgrade.

For more detailed information, see the complete changelog[1] and the release 
notes[2]

I hope you can give me some quick advise on updating from 10.2.5 -> 
10.2.6 a running ceph cluster. Can I just do an yum update and I'm good?


Thanks for making that clear and regards . Götz




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pgs stuck inactive

2017-03-10 Thread Laszlo Budai

The OSDs are all there.

$ sudo ceph osd stat
 osdmap e60609: 72 osds: 72 up, 72 in

an I have attached the result of ceph osd tree, and ceph osd dump commands.
I got some extra info about the network problem. A faulty network device has 
flooded the network eating up all the bandwidth so the OSDs were not able to 
properly communicate with each other. This has lasted for almost 1 day.

Thank you,
Laszlo


On 10.03.2017 12:19, Brad Hubbard wrote:

To me it looks like someone may have done an "rm" on these OSDs but
not removed them from the crushmap. This does not happen
automatically.

Do these OSDs show up in "ceph osd tree" and "ceph osd dump" ? If so,
paste the output.

Without knowing what exactly happened here it may be difficult to work
out how to proceed.

In order to go clean the primary needs to communicate with multiple
OSDs, some of which are marked DNE and seem to be uncontactable.

This seems to be more than a network issue (unless the outage is still
happening).

http://docs.ceph.com/docs/master/rados/operations/pg-states/?highlight=incomplete



On Fri, Mar 10, 2017 at 6:09 PM, Laszlo Budai  wrote:

Hello,

I was informed that due to a networking issue the ceph cluster network was
affected. There was a huge packet loss, and network interfaces were
flipping. That's all I got.
This outage has lasted a longer period of time. So I assume that some OSD
may have been considered dead and the data from them has been moved away to
other PGs (this is what ceph is supposed to do if I'm correct). Probably
that was the point when the listed PGs have appeared into the picture.
From the query we can see this for one of those OSDs:
{
"peer": "14",
"pgid": "3.367",
"last_update": "0'0",
"last_complete": "0'0",
"log_tail": "0'0",
"last_user_version": 0,
"last_backfill": "MAX",
"purged_snaps": "[]",
"history": {
"epoch_created": 4,
"last_epoch_started": 54899,
"last_epoch_clean": 55143,
"last_epoch_split": 0,
"same_up_since": 60603,
"same_interval_since": 60603,
"same_primary_since": 60593,
"last_scrub": "2852'33528",
"last_scrub_stamp": "2017-02-26 02:36:55.210150",
"last_deep_scrub": "2852'16480",
"last_deep_scrub_stamp": "2017-02-21 00:14:08.866448",
"last_clean_scrub_stamp": "2017-02-26 02:36:55.210150"
},
"stats": {
"version": "0'0",
"reported_seq": "14",
"reported_epoch": "59779",
"state": "down+peering",
"last_fresh": "2017-02-27 16:30:16.230519",
"last_change": "2017-02-27 16:30:15.267995",
"last_active": "0.00",
"last_peered": "0.00",
"last_clean": "0.00",
"last_became_active": "0.00",
"last_became_peered": "0.00",
"last_unstale": "2017-02-27 16:30:16.230519",
"last_undegraded": "2017-02-27 16:30:16.230519",
"last_fullsized": "2017-02-27 16:30:16.230519",
"mapping_epoch": 60601,
"log_start": "0'0",
"ondisk_log_start": "0'0",
"created": 4,
"last_epoch_clean": 55143,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "2852'33528",
"last_scrub_stamp": "2017-02-26 02:36:55.210150",
"last_deep_scrub": "2852'16480",
"last_deep_scrub_stamp": "2017-02-21 00:14:08.866448",
"last_clean_scrub_stamp": "2017-02-26 02:36:55.210150",
"log_size": 0,
"ondisk_log_size": 0,
"stats_invalid": "0",
"stat_sum": {
"num_bytes": 0,
"num_objects": 0,
"num_object_clones": 0,
"num_object_copies": 0,
"num_objects_missing_on_primary": 0,
"num_objects_degraded": 0,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 0,
"num_whiteouts": 0,
"num_read": 0,
"num_read_kb": 0,
"num_write": 0,
"num_write_kb": 0,
"num_scrub_errors": 0,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 0,
"num_bytes_recovered": 0,
"num_keys_recovered": 0,
"num_objects_omap": 0,
"num_objects_hit_set_archive": 0,
"num_bytes_hit_

Re: [ceph-users] can a OSD affect performance from pool X when blocking/slow requests PGs from pool Y ?

2017-03-10 Thread Alejandro Comisario
Any thoughts ?

On Tue, Mar 7, 2017 at 3:17 PM, Alejandro Comisario 
wrote:

> Gregory, thanks for the response, what you've said is by far, the most
> enlightneen thing i know about ceph in a long time.
>
> What brings even greater doubt, which is, this "non-functional" pool, was
> only 1.5GB large, vs 50-150GB on the other effected pools, the tiny pool
> was still being used, and just because that pool was blovking requests, the
> whole cluster was unresponsive.
>
> So , what do you mean by "non-functional" pool ? how a pool can become
> non-functional ? and what asures me that tomorrow (just becaue i deleted
> the 1.5GB pool to fix the whole problem) another pool doesnt becomes
> non-functional ?
>
> Ceph Bug ?
> Another Bug ?
> Something than can be avoided ?
>
>
> On Tue, Mar 7, 2017 at 2:11 PM, Gregory Farnum  wrote:
>
>> Some facts:
>> The OSDs use a lot of gossip protocols to distribute information.
>> The OSDs limit how many client messages they let in to the system at a
>> time.
>> The OSDs do not distinguish between client ops for different pools (the
>> blocking happens before they have any idea what the target is).
>>
>> So, yes: if you have a non-functional pool and clients keep trying to
>> access it, those requests can fill up the OSD memory queues and block
>> access to other pools as it cascades across the system.
>>
>> On Sun, Mar 5, 2017 at 6:22 PM Alejandro Comisario 
>> wrote:
>>
>>> Hi, we have a 7 nodes ubuntu ceph hammer pool (78 OSD to be exact).
>>> This weekend we'be experienced a huge outage from our customers vms
>>> (located on pool CUSTOMERS, replica size 3 ) when lots of OSD's
>>> started to slow request/block PG's on pool PRIVATE ( replica size 1 )
>>> basically all PG's blocked where just one OSD in the acting set, but
>>> all customers on the other pool got their vms almost freezed.
>>>
>>> while trying to do basic troubleshooting like doing noout and then
>>> bringing down the OSD that slowed/blocked the most, inmediatelly
>>> another OSD slowed/locked iops on pgs from the same PRIVATE pool, so
>>> we rolled back that change and started to move data around with the
>>> same logic (reweighting down those OSD) with exactly the same result.
>>>
>>> So, me made a decition, we decided to delete the pool where all PGS
>>> where slowed/locked allways despite the osd.
>>>
>>> Not even 10 secconds passes after the pool deletion, where not only
>>> there were no more degraded PGs, bit also ALL slow iops dissapeared
>>> for ever, and performance from hundreds of vms came to normal
>>> immediately.
>>>
>>> I must say that i was kinda scared to see that happen, bascally
>>> because there was only ONE POOL's PGS always slowed, but performance
>>> hit the another pool, so ... did not the PGS that exists on one pool
>>> are not shared by the other ?
>>> If my assertion is true, why OSD's locking iops from one pool's pg
>>> slowed down all other pgs from other pools ?
>>>
>>> again, i just deleted a pool that has almost no traffic, because its
>>> pgs were locked and affected pgs on another pool, and as soon as that
>>> happened, the whole cluster came back to normal (and of course,
>>> HEALTH_OK and no slow transaction whatsoever)
>>>
>>> please, someone help me understand the gap where i miss something,
>>> since this , as long as my ceph knowledge is concerned, makes no
>>> sense.
>>>
>>> PS: i have found someone that , looks like went through the same here:
>>> https://forum.proxmox.com/threads/ceph-osd-failure-causing-
>>> proxmox-node-to-crash.20781/
>>> but i still dont understand what happened.
>>>
>>> hoping to get the help from the community.
>>>
>>> --
>>> Alejandrito.
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>
>
> --
> *Alejandro Comisario*
> *CTO | NUBELIU*
> E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
> _
> www.nubeliu.com
>



-- 
*Alejandro Comisario*
*CTO | NUBELIU*
E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
_
www.nubeliu.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS PG calculation

2017-03-10 Thread John Spray
On Fri, Mar 10, 2017 at 9:11 AM, Eneko Lacunza  wrote:
> Hi Martin,
>
> Take a look at
> http://ceph.com/pgcalc/

As a rough guide, use the "RBD" example to work out how many PGs your
CephFS data pool should have.

The metadata pool can almost certainly have far fewer, maybe even like
10x fewer -- would be interested to hear anyone's practical
experiences with that though.

John

>
> Cheers
> Eneko
>
> El 10/03/17 a las 09:54, Martin Wittwer escribió:
>>
>> Hi List
>>
>> I am creating a POC cluster with CephFS as a backend for our backup
>> infrastructure. The backups are rsyncs of whole servers.
>> I have 4 OSD nodes with 10 4TB disks and 2 SSDs for journaling per node.
>>
>> My question is now how to calculate the PG count for that scenario? Is
>> there a way to calculate how many PGs the data/metadata pool needs or
>> are there any recommendations?
>>
>> Best
>>
>
>
> --
> Zuzendari Teknikoa / Director Técnico
> Binovo IT Human Project, S.L.
> Telf. 943493611
>   943324914
> Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa)
> www.binovo.es
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pgs stuck inactive

2017-03-10 Thread Brad Hubbard
To me it looks like someone may have done an "rm" on these OSDs but
not removed them from the crushmap. This does not happen
automatically.

Do these OSDs show up in "ceph osd tree" and "ceph osd dump" ? If so,
paste the output.

Without knowing what exactly happened here it may be difficult to work
out how to proceed.

In order to go clean the primary needs to communicate with multiple
OSDs, some of which are marked DNE and seem to be uncontactable.

This seems to be more than a network issue (unless the outage is still
happening).

http://docs.ceph.com/docs/master/rados/operations/pg-states/?highlight=incomplete



On Fri, Mar 10, 2017 at 6:09 PM, Laszlo Budai  wrote:
> Hello,
>
> I was informed that due to a networking issue the ceph cluster network was
> affected. There was a huge packet loss, and network interfaces were
> flipping. That's all I got.
> This outage has lasted a longer period of time. So I assume that some OSD
> may have been considered dead and the data from them has been moved away to
> other PGs (this is what ceph is supposed to do if I'm correct). Probably
> that was the point when the listed PGs have appeared into the picture.
> From the query we can see this for one of those OSDs:
> {
> "peer": "14",
> "pgid": "3.367",
> "last_update": "0'0",
> "last_complete": "0'0",
> "log_tail": "0'0",
> "last_user_version": 0,
> "last_backfill": "MAX",
> "purged_snaps": "[]",
> "history": {
> "epoch_created": 4,
> "last_epoch_started": 54899,
> "last_epoch_clean": 55143,
> "last_epoch_split": 0,
> "same_up_since": 60603,
> "same_interval_since": 60603,
> "same_primary_since": 60593,
> "last_scrub": "2852'33528",
> "last_scrub_stamp": "2017-02-26 02:36:55.210150",
> "last_deep_scrub": "2852'16480",
> "last_deep_scrub_stamp": "2017-02-21 00:14:08.866448",
> "last_clean_scrub_stamp": "2017-02-26 02:36:55.210150"
> },
> "stats": {
> "version": "0'0",
> "reported_seq": "14",
> "reported_epoch": "59779",
> "state": "down+peering",
> "last_fresh": "2017-02-27 16:30:16.230519",
> "last_change": "2017-02-27 16:30:15.267995",
> "last_active": "0.00",
> "last_peered": "0.00",
> "last_clean": "0.00",
> "last_became_active": "0.00",
> "last_became_peered": "0.00",
> "last_unstale": "2017-02-27 16:30:16.230519",
> "last_undegraded": "2017-02-27 16:30:16.230519",
> "last_fullsized": "2017-02-27 16:30:16.230519",
> "mapping_epoch": 60601,
> "log_start": "0'0",
> "ondisk_log_start": "0'0",
> "created": 4,
> "last_epoch_clean": 55143,
> "parent": "0.0",
> "parent_split_bits": 0,
> "last_scrub": "2852'33528",
> "last_scrub_stamp": "2017-02-26 02:36:55.210150",
> "last_deep_scrub": "2852'16480",
> "last_deep_scrub_stamp": "2017-02-21 00:14:08.866448",
> "last_clean_scrub_stamp": "2017-02-26 02:36:55.210150",
> "log_size": 0,
> "ondisk_log_size": 0,
> "stats_invalid": "0",
> "stat_sum": {
> "num_bytes": 0,
> "num_objects": 0,
> "num_object_clones": 0,
> "num_object_copies": 0,
> "num_objects_missing_on_primary": 0,
> "num_objects_degraded": 0,
> "num_objects_misplaced": 0,
> "num_objects_unfound": 0,
> "num_objects_dirty": 0,
> "num_whiteouts": 0,
> "num_read": 0,
> "num_read_kb": 0,
> "num_write": 0,
> "num_write_kb": 0,
> "num_scrub_errors": 0,
> "num_shallow_scrub_errors": 0,
> "num_deep_scrub_errors": 0,
> "num_objects_recovered": 0,
> "num_bytes_recovered": 0,
> "num_keys_recovered": 0,
> "num_objects_omap": 0,
> "num_objects_hit_set_archive": 0,
> "num_bytes_hit_set_archive": 0
> },
> "up": [
> 28,
> 35,
> 2
> ],
> "acting": [
> 28,
> 35,
> 2
> ]

Re: [ceph-users] CephFS PG calculation

2017-03-10 Thread Eneko Lacunza

Hi Martin,

Take a look at
http://ceph.com/pgcalc/

Cheers
Eneko

El 10/03/17 a las 09:54, Martin Wittwer escribió:

Hi List

I am creating a POC cluster with CephFS as a backend for our backup
infrastructure. The backups are rsyncs of whole servers.
I have 4 OSD nodes with 10 4TB disks and 2 SSDs for journaling per node.

My question is now how to calculate the PG count for that scenario? Is
there a way to calculate how many PGs the data/metadata pool needs or
are there any recommendations?

Best




--
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943493611
  943324914
Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa)
www.binovo.es

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS PG calculation

2017-03-10 Thread Martin Wittwer
Hi List

I am creating a POC cluster with CephFS as a backend for our backup
infrastructure. The backups are rsyncs of whole servers.
I have 4 OSD nodes with 10 4TB disks and 2 SSDs for journaling per node.

My question is now how to calculate the PG count for that scenario? Is
there a way to calculate how many PGs the data/metadata pool needs or
are there any recommendations?

Best

-- 
---
Martin Wittwer

DATONUS Switzerland
Heimensteinstrasse 8D, CH-8472 Seuzach
Telefon +41 52 511 87 80, Direkt +41 52 511 87 82
Support +41 52 511 87 87 (Kostenpflichtig)
martin.witt...@datonus.ch, www.datonus.ch
---


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pgs stuck inactive

2017-03-10 Thread Laszlo Budai

Hello,

I was informed that due to a networking issue the ceph cluster network was 
affected. There was a huge packet loss, and network interfaces were flipping. 
That's all I got.
This outage has lasted a longer period of time. So I assume that some OSD may 
have been considered dead and the data from them has been moved away to other 
PGs (this is what ceph is supposed to do if I'm correct). Probably that was the 
point when the listed PGs have appeared into the picture.
From the query we can see this for one of those OSDs:
{
"peer": "14",
"pgid": "3.367",
"last_update": "0'0",
"last_complete": "0'0",
"log_tail": "0'0",
"last_user_version": 0,
"last_backfill": "MAX",
"purged_snaps": "[]",
"history": {
"epoch_created": 4,
"last_epoch_started": 54899,
"last_epoch_clean": 55143,
"last_epoch_split": 0,
"same_up_since": 60603,
"same_interval_since": 60603,
"same_primary_since": 60593,
"last_scrub": "2852'33528",
"last_scrub_stamp": "2017-02-26 02:36:55.210150",
"last_deep_scrub": "2852'16480",
"last_deep_scrub_stamp": "2017-02-21 00:14:08.866448",
"last_clean_scrub_stamp": "2017-02-26 02:36:55.210150"
},
"stats": {
"version": "0'0",
"reported_seq": "14",
"reported_epoch": "59779",
"state": "down+peering",
"last_fresh": "2017-02-27 16:30:16.230519",
"last_change": "2017-02-27 16:30:15.267995",
"last_active": "0.00",
"last_peered": "0.00",
"last_clean": "0.00",
"last_became_active": "0.00",
"last_became_peered": "0.00",
"last_unstale": "2017-02-27 16:30:16.230519",
"last_undegraded": "2017-02-27 16:30:16.230519",
"last_fullsized": "2017-02-27 16:30:16.230519",
"mapping_epoch": 60601,
"log_start": "0'0",
"ondisk_log_start": "0'0",
"created": 4,
"last_epoch_clean": 55143,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "2852'33528",
"last_scrub_stamp": "2017-02-26 02:36:55.210150",
"last_deep_scrub": "2852'16480",
"last_deep_scrub_stamp": "2017-02-21 00:14:08.866448",
"last_clean_scrub_stamp": "2017-02-26 02:36:55.210150",
"log_size": 0,
"ondisk_log_size": 0,
"stats_invalid": "0",
"stat_sum": {
"num_bytes": 0,
"num_objects": 0,
"num_object_clones": 0,
"num_object_copies": 0,
"num_objects_missing_on_primary": 0,
"num_objects_degraded": 0,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 0,
"num_whiteouts": 0,
"num_read": 0,
"num_read_kb": 0,
"num_write": 0,
"num_write_kb": 0,
"num_scrub_errors": 0,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 0,
"num_bytes_recovered": 0,
"num_keys_recovered": 0,
"num_objects_omap": 0,
"num_objects_hit_set_archive": 0,
"num_bytes_hit_set_archive": 0
},
"up": [
28,
35,
2
],
"acting": [
28,
35,
2
],
"blocked_by": [],
"up_primary": 28,
"acting_primary": 28
},
"empty": 1,
"dne": 0,
"incomplete": 0,
"last_epoch_started": 0,
"hit_set_history": {
"current_last_update": "0'0",
"current_last_stamp": "0.00",
"current_info": {
"begin": "0.00",
"end": "0.00",
"version": "0'0",
"using_gmt": "1"
},
"history": []
}
},

Where can I read more about the meaning of each parameter, some of them have 
quite self explanatory names, but not all (or probably we need a deeper 
knowledge to understand them).
Isn't there any parameter that would say when was that OSD 

Re: [ceph-users] [Jewel] upgrade 10.2.3 => 10.2.5 KO : first OSD server freeze every two days :)

2017-03-10 Thread pascal.pu...@pci-conseil.net

Hello,

This night, same effect, new freeze... BUT, I found this morning maybe 
why !


A stupid boy added "vm.vfs_cache_pressure=1" for tuning and forget to 
remove after on first OSD node... bad boy :)


There is always an explanation. It could not be otherwise.

This was maybe fast good before upgrade, but not after... That explain a 
lot, why load is ever au little greater as other... etc... etc...


We will see in two days.

Sorry, sorry, sorry :|


Le 09/03/2017 à 13:45, pascal.pu...@pci-conseil.net a écrit :

Le 09/03/2017 à 13:03, Vincent Godin a écrit :
First of all, don't do a ceph upgrade while your cluster is in 
warning or error state. A process upgrade must be done from an clean 
cluster.

of course.

So, Yesterday, so I try this for my "unfound PG"

ceph pg 50.2dd mark_unfound_lost revert => MON crash :(
so :
ceph pg 50.2dd mark_unfound_lost delete => OK.

Cluster was Health OK => So I finaly migrate all to version Jewel 
10.2.6. So this night, nothing, all worked fine (trimfs of rbd was 
disabled).

Maybe next. It's always after two days. (scrubing  is 22h to 6h).

Don't stay with a replicate at 2. Majority of problems come from that 
point: just look the advices given by experience users of the list. 
You should set a replicate of 3 and a min_size at 2. This will 
prevent you to fail some data because of a double fault which is 
frequent.
I already had a faulty some pg found by scrubbing processus (disk IO 
error) and had to remove bad PG myself. As I understood, with 3 
replica, repair would be automatique.

Ok, I will change to 3. :)
For your specific problem, i have no idea of the root cause. If you 
have already checked your network (tuning parameters, enable jumbo, 
etc..), your software version on all the components, your hardware 
(raid card, system messages, ...), may be you should just re-install 
your first OSD server. I had a big problem after an upgrade from 
hammer to jewel and nobody seems to have encountered it doing the 
same operation. All servers were configured the same way but they had 
not the same history.We found that the problem came from the 
differents versions we installed on some OSD servers (giant -> hammer 
-> jewel). OSD servers which never knew the giant version had no 
problem at all. We had on the problematic servers (in jewel) some 
bugs which was corrected years ago in giant !!!. So we have to 
isolate those servers and reinstall them directly in jewel : it 
solved the problem.


OK. I will think about it.

But, all node are realy same = > check all node with rpm -Va => OK. 
Tuning all, etc... check network ok... It came just the day after 
upgrade :)


Thanks for you advise. We will see this night. :)

Pascal.






--
*Performance Conseil Informatique*
Pascal Pucci
Consultant Infrastructure
pascal.pu...@pci-conseil.net 
Mobile : 06 51 47 84 98
Bureau : 02 85 52 41 81
http://www.performance-conseil-informatique.net /*News :*
On transforme ! 

Comme promis, en 2017 on transforme ! A vos côtés, nous transformons 
votre infrastructure informatique tout en gardant les fondamentaux PCI : 
Conti...

/

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com