Re: [ceph-users] How to detect journal problems

Josef Johansson Wed, 09 Apr 2014 08:23:43 -0700

Thanks all for helping to clarify in this matter :)

On 09/04/14 17:03, Christian Balzer wrote:
> Hello,
>
> On Wed, 9 Apr 2014 07:31:53 -0700 Gregory Farnum wrote:
>
>> journal_max_write_bytes: the maximum amount of data the journal will
>> try to write at once when it's coalescing multiple pending ops in the
>> journal queue.
>> journal_queue_max_bytes: the maximum amount of data allowed to be
>> queued for journal writing.
>>
>> In particular, both of those are about how much is waiting to get into
>> the durable journal, not waiting to get flushed out of it.
> Thanks a bundle for that clarification Greg.
>
> So the tunable to play with when trying to push the backing storage to its
> throughput limits would be "filestore min sync interval" then?
>
> Or can something else cause the journal to be flushed long before it
> becomes full?
This. Because this is what I see. I see the OSDs writing in 1-3MB/s with
300w/s, with 100% util. Which makes me want to optimize the journal further.


Even if I cram the journal_queue settings higher, it seems to stay that way.

My idea of the journal making everything sequential was that the data
would merge  inside the journal and get out on the disk as nice
sequential I/O.

I assume that it also could be that it didn't manage to merge the ops
because they were spread out too much. As the objects are 4M maybe the
4K data is spread out over different objects.

Cheers,
Josef
> Christian
>
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>>
>> On Wed, Apr 9, 2014 at 3:06 AM, Christian Balzer <ch...@gol.com> wrote:
>>> On Tue, 8 Apr 2014 09:35:19 -0700 Gregory Farnum wrote:
>>>
>>>> On Tuesday, April 8, 2014, Christian Balzer <ch...@gol.com> wrote:
>>>>
>>>>> On Tue, 08 Apr 2014 14:19:20 +0200 Josef Johansson wrote:
>>>>>> On 08/04/14 10:39, Christian Balzer wrote:
>>>>>>> On Tue, 08 Apr 2014 10:31:44 +0200 Josef Johansson wrote:
>>>>>>>
>>>>>>>> On 08/04/14 10:04, Christian Balzer wrote:
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> On Tue, 08 Apr 2014 09:31:18 +0200 Josef Johansson wrote:
>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> I am currently benchmarking a standard setup with Intel DC
>>>>>>>>>> S3700 disks as journals and Hitachi 4TB-disks as
>>>>>>>>>> data-drives, all on LACP 10GbE network.
>>>>>>>>>>
>>>>>>>>> Unless that is the 400GB version of the DC3700, you're already
>>>>>>>>> limiting yourself to 365MB/s throughput with the 200GB
>>>>>>>>> variant. If sequential write speed is that important to you
>>>>>>>>> and you think you'll ever get those 5 HDs to write at full
>>>>>>>>> speed with Ceph (unlikely).
>>>>>>>> It's the 400GB version of the DC3700, and yes, I'm aware that I
>>>>>>>> need a 1:3 ratio to max out these disks, as they write
>>>>>>>> sequential data at about 150MB/s.
>>>>>>>> But our thoughts are that it would cover the current demand
>>>>>>>> with a 1:5 ratio, but we could upgrade.
>>>>>>> I'd reckon you'll do fine, as in run out of steam and IOPS
>>>>>>> before hitting that limit.
>>>>>>>
>>>>>>>>>> The size of my journals are 25GB each, and I have two
>>>>>>>>>> journals per machine, with 5 OSDs per journal, with 5
>>>>>>>>>> machines in total. We currently use the tunables optimal and
>>>>>>>>>> the version of ceph is the latest dumpling.
>>>>>>>>>>
>>>>>>>>>> Benchmarking writes with rbd show that there's no problem
>>>>>>>>>> hitting upper levels on the 4TB-disks with sequential data,
>>>>>>>>>> thus maxing out 10GbE. At this moment we see full utilization
>>>>>>>>>> on the journals.
>>>>>>>>>>
>>>>>>>>>> But lowering the byte-size to 4k shows that the journals are
>>>>>>>>>> utilized to about 20%, and the 4TB-disks 100%. (rados -p
>>>>>>>>>> <pool> -b 4096 -t 256 100 write)
>>>>>>>>>>
>>>>>>>>> When you're saying utilization I assume you're talking about
>>>>>>>>> iostat or atop output?
>>>>>>>> Yes, the utilization is iostat.
>>>>>>>>> That's not a bug, that's comparing apple to oranges.
>>>>>>>> You mean comparing iostat-results with the ones from rados
>>>>>>>> benchmark?
>>>>>>>>> The rados bench default is 4MB, which not only happens to be
>>>>>>>>> the default RBD objectsize but also to generate a nice amount
>>>>>>>>> of bandwidth.
>>>>>>>>>
>>>>>>>>> While at 4k writes your SDD is obviously bored, but actual OSD
>>>>>>>>> needs to handle all those writes and becomes limited by the
>>>>>>>>> IOPS it can peform.
>>>>>>>> Yes, it's quite bored and just shuffles data.
>>>>>>>> Maybe I've been thinking about this the wrong way,
>>>>>>>> but shouldn't the Journal buffer more until the Journal
>>>>>>>> partition is full or when the flush_interval is met.
>>>>>>>>
>>>>>>> Take a look at "journal queue max ops", which has a default of a
>>>>>>> mere 500, so that's full after 2 seconds. ^o^
>>>>>> Hm, that makes sense.
>>>>>>
>>>>>> So, tested out both low values ( 5000 )  and large value
>>>>>> ( 6553600 ), but it didn't seem that change anything.
>>>>>> Any chance I could dump the current values from a running OSD, to
>>>>>> actually see what is in use?
>>>>>>
>>>>> The value can be checked like this (for example):
>>>>> ceph --admin-daemon /var/run/ceph/ceph-osd.2.asok config show
>>>>>
>>>>> If you restarted your OSD after updating ceph.conf I'm sure you will
>>>>> find the values you set there.
>>>>>
>>>>> However you are seriously underestimating the packet storm you're
>>>>> unleashing with 256 threads of 4KB packets over a 10Gb/s link.
>>>>>
>>>>> That's theoretically 256K packets/s, very quickly filling even your
>>>>> "large" max ops setting.
>>>>> Also the "journal max write entries" will need to be adjusted to
>>>>> suit the abilities (speed and merge wise) of your OSDs.
>>>>>
>>>>> With 40 million max ops and 2048 max write I get this (instead of
>>>>> similar values to you with the defaults):
>>>>>
>>>>>      1     256      2963      2707   10.5707   10.5742  0.125177
>>>>> 0.0830565 2     256      5278      5022   9.80635   9.04297
>>>>> 0.247757 0.0968146 3     256      7276      7020   9.13867
>>>>> 7.80469  0.002813 0.0994022 4     256      8774      8518
>>>>> 8.31665   5.85156  0.002976 0.107339 5     256     10121
>>>>> 9865   7.70548   5.26172  0.002569 0.117767 6     256     11363
>>>>> 11107   7.22969   4.85156   0.38909 0.130649 7     256
>>>>> 12354     12098    6.7498   3.87109  0.002857 0.137199 8
>>>>> 256     12392     12136   5.92465  0.148438   1.15075 0.138359
>>>>> 9     256     12551     12295   5.33538  0.621094  0.003575
>>>>> 0.151978 10     256     13099     12843    5.0159   2.14062
>>>>> 0.146283   0.17639
>>>>>
>>>>> Of course this tails off eventually, but the effect is obvious and
>>>>> the bandwidth is double that of the default values.
>>>>>
>>>>> I'm sure some inktank person will pipe up momentarily as to why
>>>>> these defaults were chosen and why such huge values are to be
>>>>> avoided. ^.-
>>>>>
>>>> Just from skimming, those numbers do look a little low, but I'm not
>>>> sure how all the latencies work out.
>>>>
>>>> Anyway, the reason we chose the low numbers is to avoid overloading a
>>>> backing hard drive, which is going to have a lot more trouble than the
>>>> journal with a huge backlog of ops. You'll want to test your small IO
>>>> results for a very long time/with a fairly small journal to check that
>>>> you don't get a square wave of throughput when waiting for the backing
>>>> disk to commit everything to disk.
>>>>
>>> I assume that's the same reason for the default values of these
>>> parameters?
>>>
>>>   "journal_max_write_bytes": "10485760",
>>>   "journal_queue_max_bytes": "33554432",
>>>
>>> A mere 10 and 32MB.
>>>
>>> According to the documentation I read this as no more than 10MB per
>>> write to the filestore and no more than 32MB in the queue ever.
>>> The queue being the entire journal or a per client/connection thing?
>>>
>>> If the entire journal, why do people use 10GB or in my case 40GB
>>> journals? ^o^
>>>
>>> Regards,
>>>
>>> Christian
>>> --
>>> Christian Balzer        Network/Systems Engineer
>>> ch...@gol.com           Global OnLine Japan/Fusion Communications
>>> http://www.gol.com/
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to detect journal problems

Reply via email to