> On 28 Feb 2024, at 23:03, Matthew Grooms <[email protected]> wrote:
>
> ...
> The virtual disks were provisioned with either a 128G disk image or a 1TB raw
> partition, so I don't think space was an issue.
> Trim is definitely not an issue. I'm using a tiny fraction of the 32TB array
> have tried both heavily under-provisioned HW RAID10 and SW RAID10 using GEOM.
> The latter was tested after sending full trim resets to all drives
> individually.
>
It could be then TRIM/UNMAP is not used, zvol (for the instance) becomes full
for the while. ZFS considers it as all blocks are used and write operations
could have troubles. I believe it was recently fixed.
Also look at this one:
GuestFS->UNMAP->bhyve->Host-FS->PhysicalDisk
The problem of UNMAP that it could have unpredictable slowdown at any time. So
I would suggest to check results with enabled and disabled UNMAP in a guest.
> I will try to incorporate the rest of your feedback into my next round of
> testing. If I can find a benchmark tool that works with a raw block device,
> that would be ideal.
>
>
Use “dd” as the first step for read testing;
~# dd if=/dev/nvme0n2 of=/dev/null bs=1M status=progress flag=direct
~# dd if=/dev/nvme0n2 of=/dev/null bs=1M status=progress
Compare results with directio and without.
“fio” tool.
1) write prepare:
~# fio --name=prep --rw=write --verify=crc32 --loop=1 --numjobs=2
--time_based --thread --bs=1M --iodepth=32 --ioengine=libaio --direct=1
--group_reporting --size=20G --filename=/dev/nvme0n2
2) read test:
~# fio --name=readtest --rw=read —loop=30 --numjobs=2 --time_based
--thread —bs=256K --iodepth=32 --ioengine=libaio --direct=1
--group_reporting --size=20G --filename=/dev/nvme0n2
—
Vitaliy
> Thanks,
>
> -Matthew
>
>
>
>> ——
>> Vitaliy
>>
>>> On 28 Feb 2024, at 21:29, Matthew Grooms <[email protected]>
>>> <mailto:[email protected]> wrote:
>>>
>>> On 2/27/24 04:21, Vitaliy Gusev wrote:
>>>> Hi,
>>>>
>>>>
>>>>> On 23 Feb 2024, at 18:37, Matthew Grooms <[email protected]>
>>>>> <mailto:[email protected]> wrote:
>>>>>
>>>>>> ...
>>>>> The problem occurs when an image file is used on either ZFS or UFS. The
>>>>> problem also occurs when the virtual disk is backed by a raw disk
>>>>> partition or a ZVOL. This issue isn't related to a specific underlying
>>>>> filesystem.
>>>>>
>>>>
>>>> Do I understand right, you ran testing inside VM inside guest VM on ext4
>>>> filesystem? If so you should be aware about additional overhead in
>>>> comparison when you were running tests on the hosts.
>>>>
>>> Hi Vitaliy,
>>>
>>> I appreciate you providing the feedback and suggestions. I spent over a
>>> week trying as many combinations of host and guest options as possible to
>>> narrow this issue down to a specific host storage or a guest device model
>>> option. Unfortunately the problem occurred with every combination I tested
>>> while running Linux as the guest. Note, I only tested RHEL8 & RHEL9
>>> compatible distributions ( Alma & Rocky ). The problem did not occur when I
>>> ran FreeBSD as the guest. The problem did not occur when I ran KVM in the
>>> host and Linux as the guest.
>>>
>>>> I would suggest to run fio (or even dd) on raw disk device inside VM, i.e.
>>>> without filesystem at all. Just do not forget do “echo 3 >
>>>> /proc/sys/vm/drop_caches” in Linux Guest VM before you run tests.
>>> The two servers I was using to test with are are no longer available.
>>> However, I'll have two more identical servers arriving in the next week or
>>> so. I'll try to run additional tests and report back here. I used bonnie++
>>> as that was easily installed from the package repos on all the systems I
>>> tested.
>>>
>>>>
>>>> Could you also give more information about:
>>>>
>>>> 1. What results did you get (decode bonnie++ output)?
>>> If you look back at this email thread, there are many examples of running
>>> bonnie++ on the guest. I first ran the tests on the host system using Linux
>>> + ext4 and FreeBSD 14 + UFS & ZFS to get a baseline of performance. Then I
>>> ran bonnie++ tests using bhyve as the hypervisor and Linux & FreeBSD as the
>>> guest. The combination of host and guest storage options included ...
>>>
>>> 1) block device + virtio blk
>>> 2) block device + nvme
>>> 3) UFS disk image + virtio blk
>>> 4) UFS disk image + nvme
>>> 5) ZFS disk image + virtio blk
>>> 6) ZFS disk image + nvme
>>> 7) ZVOL + virtio blk
>>> 8) ZVOL + nvme
>>>
>>> In every instance, I observed the Linux guest disk IO often perform very
>>> well for some time after the guest was first booted. Then the performance
>>> of the guest would drop to a fraction of the original performance. The
>>> benchmark test was run every 5 or 10 minutes in a cron job. Sometimes the
>>> guest would perform well for up to an hour before performance would drop
>>> off. Most of the time it would only perform well for a few cycles ( 10 - 30
>>> mins ) before performance would drop off. The only way to restore the
>>> performance was to reboot the guest. Once I determined that the problem was
>>> not specific to a particular host or guest storage option, I switched my
>>> testing to only use a block device as backing storage on the host to avoid
>>> hitting any system disk caches.
>>>
>>> Here is the test script I used in the cron job ...
>>>
>>> #!/bin/sh
>>> FNAME='output.txt'
>>>
>>> echo
>>> ================================================================================
>>> >> $FNAME
>>> echo Begin @ `/usr/bin/date` >> $FNAME
>>> echo >> $FNAME
>>> /usr/sbin/bonnie++ 2>&1 | /usr/bin/grep -v 'done\|,' >> $FNAME
>>> echo >> $FNAME
>>> echo End @ `/usr/bin/date` >> $FNAME
>>>
>>> As you can see, I'm calling bonnie++ with the system defaults. That uses a
>>> data set size that's 2x the guest RAM in an attempt to minimize the effect
>>> of filesystem cache on results. Here is an example of the output that
>>> bonnie++ produces ...
>>>
>>> Version 2.00 ------Sequential Output------ --Sequential Input-
>>> --Random-
>>> -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
>>> --Seeks--
>>> Name:Size etc /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec
>>> %CP
>>> linux-blk 63640M 694k 99 1.6g 99 737m 76 985k 99 1.3g 69 +++++
>>> +++
>>> Latency 11579us 535us 11889us 8597us 21819us
>>> 8238us
>>> Version 2.00 ------Sequential Create------ --------Random
>>> Create--------
>>> linux-blk -Create-- --Read--- -Delete-- -Create-- --Read---
>>> -Delete--
>>> files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec
>>> %CP
>>> 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++
>>> +++
>>> Latency 7620us 126us 1648us 151us 15us
>>> 633us
>>>
>>> --------------------------------- speed drop
>>> ---------------------------------
>>>
>>> Version 2.00 ------Sequential Output------ --Sequential Input-
>>> --Random-
>>> -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
>>> --Seeks--
>>> Name:Size etc /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec
>>> %CP
>>> linux-blk 63640M 676k 99 451m 99 314m 93 951k 99 402m 99 15167
>>> 530
>>> Latency 11902us 8959us 24711us 10185us 20884us
>>> 5831us
>>> Version 2.00 ------Sequential Create------ --------Random
>>> Create--------
>>> linux-blk -Create-- --Read--- -Delete-- -Create-- --Read---
>>> -Delete--
>>> files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec
>>> %CP
>>> 16 0 96 +++++ +++ +++++ +++ 0 96 +++++ +++ 0
>>> 75
>>> Latency 343us 165us 1636us 113us 55us
>>> 1836us
>>>
>>> In the example above, the benchmark test repeated about 20 times with
>>> results that were similar to the performance shown above the dotted line (
>>> ~ 1.6g/s seq write and 1.3g/s seq read ). After that, the performance
>>> dropped to what's shown below the dotted line which is less than 1/4 the
>>> original speed ( ~ 451m/s seq write and 402m/s seq read ).
>>>
>>>> 2. What results expecting?
>>>>
>>> What I expect is that, when I perform the same test with the same
>>> parameters, the results would stay more or less consistent over time. This
>>> is true when KVM is used as the hypervisor on the same hardware and guest
>>> options. That said, I'm not worried about bhyve being consistently slower
>>> than kvm or a FreeBSD guest being consistently slower than a Linux guest.
>>> I'm concerned that the performance drop over time is indicative of an issue
>>> with how bhyve interacts with non-freebsd guests.
>>>
>>>> 3. VM configuration, virtio-blk disk size, etc.
>>>> 4. Full command for tests (including size of test-set), bhyve, etc.
>>> I believe this was answered above. Please let me know if you have
>>> additional questions.
>>>
>>>>
>>>> 5. Did you pass virtio-blk as 512 or 4K ? If 512, probably you should try
>>>> 4K.
>>>>
>>> The testing performed was not exclusively with virtio-blk.
>>>
>>>
>>>> 6. Linux has several read-ahead options for IO schedule, and it could be
>>>> related too.
>>>>
>>> I suppose it's possible that bhyve could be somehow causing the disk
>>> scheduler in the Linux guest to act differently. I'll see if I can figure
>>> out how to disable that in future tests.
>>>
>>>
>>>> Additionally could also you play with “sync=disabled” volume/zvol option?
>>>> Of course it is only for write testing.
>>> The testing performed was not exclusively with zvols.
>>>
>>>
>>> Once I have more hardware available, I'll try to report back with more
>>> testing. It may be interesting to also see how a Windows guest performs
>>> compared to Linux & FreeBSD. I suspect that this issue may only be
>>> triggered when a fast disk array is in use on the host. My tests use a 16x
>>> SSD RAID 10 array. It's also quite possible that the disk IO slowdown is
>>> only a symptom of another issue that's triggered by the disk IO test (
>>> please see end of my last post related to scheduler priority observations
>>> ). All I can say for sure is that ...
>>>
>>> 1) There is a problem and it's reproducible across multiple hosts
>>> 2) It affects RHEL8 & RHEL9 guests but not FreeBSD guests
>>> 3) It is not specific to any host or guest storage option
>>>
>>> Thanks,
>>>
>>> -Matthew
>>>