On 04/22/15 17:57, Jeff Epstein wrote:
>
>
> On 04/10/2015 10:10 AM, Lionel Bouton wrote:
>> On 04/10/15 15:41, Jeff Epstein wrote:
>>> [...]
>>> This seems highly unlikely. We get very good performance without
>>> ceph. Requisitioning and manupulating block devices through LVM
>>> happens instantaneously. We expect that ceph will be a bit slower by
>>> its distributed nature, but we've seen operations block for up to an
>>> hour, which is clearly behind the pale. Furthermore, as the
>>> performance measure I posted show, read/write speed is not the
>>> bottleneck: ceph is simply/waiting/.
>>>
>>> So, does anyone else have any ideas why mkfs (and other operations)
>>> takes so long?
>>
>>
>> As your use case is pretty unique and clearly not something Ceph was
>> optimized for, if I were you I'd switch to a single pool with the
>> appropriate number of pgs based on your pool size (replication) and
>> the number of OSD you use (you should target 100 pgs/OSD to be in
>> what seems the sweet spot) and create/delete rbd instead of the whole
>> pool. You would be in "known territory" and any remaining performance
>> problem would be easier to debug.
>>
> I agree that this is a good suggestion. It took me a little while, but
> I've changed the configuration so that we now have only one pool,
> containing many rbds, and now all data is spread across all six of our
> OSD nodes. However, the performance has not perceptibly improved. We
> still have the occasional long (>10 minutes) wait periods during write
> operations, and the bottleneck still seems to be ceph, rather than the
> hardware: the blocking process (most usually, but not always, mkfs) is
> stuck in a wait state ("D" in ps) but no I/O is actually being
> performed, so one can surmise that the physical limitations of the
> disk medium are not the bottleneck. This is similar to what is being
> reported in the thread titled "100% IO Wait with CEPH RBD and RSYNC".
>
> Do you have some idea how I can diagnose this problem?

I'll look at ceph -s output while you get these stuck process to see if
there's any unusual activity (scrub/deep scrub/recovery/bacfills/...).
Is it correlated in any way with rbd removal (ie: write blocking don't
appear unless you removed at least one rbd for say one hour before the
write performance problems).

Best regards,

Lionel Bouton
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to