On 6/8/22 16:15, Jakub Wartak wrote:
> Hi, got some answers!
>
> TL;DR for fio it would make sense to use many stressfiles (instead of 1) and
> same for numjobs ~ VCPU to avoid various pitfails.
> >>>> The really
>>>> puzzling thing is why is the filesystem so much slower for smaller
>>>> pages. I mean, why would writing 1K be 1/3 of writing 4K?
>>>> Why would a filesystem have such effect?
>>>
>>> Ha! I don't care at this point as 1 or 2kB seems too small to handle
>>> many real world scenarios ;)
> [..]
>> Independently of that, it seems like an interesting behavior and it might
>> tell us
>> something about how to optimize for larger pages.
>
> OK, curiosity won:
>
> With randwrite on ext4 directio using 4kb the avgqu-sz reaches ~90-100 (close
> to fio's 128 queue depth?) and I'm getting ~70k IOPS [with maxdepth=128]
> With randwrite on ext4 directio using 1kb the avgqu-sz is just 0.7 and I'm
> getting just ~17-22k IOPS [with maxdepth=128] -> conclusion: something is
> being locked thus preventing queue to build up
> With randwrite on ext4 directio using 4kb the avgqu-sz reaches ~2.3 (so
> something is queued) and I'm also getting ~70k IOPS with minimal possible
> maxdepth=4 -> conclusion: I just need to split the lock contention by 4.
>
> The 1kB (slow) profile top function is aio_write() -> .... ->
> iov_iter_get_pages() -> internal_get_user_pages_fast() and there's sadly
> plenty of "lock" keywords inside {related to memory manager, padding to full
> page size, inode locking} also one can find some articles / commits related
> to it [1] which didn't made a good feeling to be honest as the fio is using
> just 1 file (even while I'm on kernel 5.10.x). So I've switched to 4x files
> and numjobs=4 and got easily 60k IOPS, contention solved whatever it was :)
> So I would assume PostgreSQL (with it's splitting data files by default on
> 1GB boundaries and multiprocess architecture) should be relatively safe from
> such ext4 inode(?)/mm(?) contentions even with smallest 1kb block sizes on
> Direct I/O some day.
>
Interesting. So what parameter values would you suggest?
FWIW some of the tests I did were on xfs, so I wonder if that might be
hitting similar/other bottlenecks.
> [1] - https://www.phoronix.com/scan.php?page=news_item&px=EXT4-DIO-Faster-DBs
>
>>> Both scenarios (raw and fs) have had direct=1 set. I just cannot understand
>> how having direct I/O enabled (which disables caching) achieves better read
>> IOPS on ext4 than on raw device... isn't it contradiction?
>>>
>>
>> Thanks for the clarification. Not sure what might be causing this. Did you
>> use the
>> same parameters (e.g. iodepth) in both cases?
>
> Explanation: it's the CPU scheduler migrations mixing the performance result
> during the runs of fio (as you have in your framework). Various VCPUs seem
> to be having varying max IOPS characteristics (sic!) and CPU scheduler seems
> to be unaware of it. At least on 1kB and 4kB blocksize this happens also
> notice that some VCPUs [XXXX marker] don't reach 100% CPU reaching almost
> twice the result; while cores 0, 3 do reach 100% and lack CPU power to
> perform more. The only thing that I don't get is that it doesn't make sense
> from extened lscpu output (but maybe it's AWS XEN mixing real CPU mappings,
> who knows).
Uh, that's strange. I haven't seen anything like that, but I'm running
on physical HW and not AWS, so it's either that or maybe I just didn't
do the same test.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company