Re: bhyve disk performance issue

Vitaliy Gusev Wed, 28 Feb 2024 13:03:18 -0800


> On 28 Feb 2024, at 23:03, Matthew Grooms <[email protected]> wrote:
> 
> ...
> The virtual disks were provisioned with either a 128G disk image or a 1TB raw 
> partition, so I don't think space was an issue.
> Trim is definitely not an issue. I'm using a tiny fraction of the 32TB array 
> have tried both heavily under-provisioned HW RAID10 and SW RAID10 using GEOM. 
> The latter was tested after sending full trim resets to all drives 
> individually.
> 
It could be then TRIM/UNMAP is not used, zvol (for the instance) becomes full 
for the while. ZFS considers it as all blocks are used and write operations 
could  have troubles. I believe it was recently fixed.


Also look at this one:

    GuestFS->UNMAP->bhyve->Host-FS->PhysicalDisk

The problem of UNMAP that it could have unpredictable slowdown at any time. So 
I would suggest to check results with enabled and disabled UNMAP in a guest.

> I will try to incorporate the rest of your feedback into my next round of 
> testing. If I can find a benchmark tool that works with a raw block device, 
> that would be ideal.
> 
> 
Use “dd” as the first step for read testing;

   ~# dd if=/dev/nvme0n2 of=/dev/null bs=1M status=progress flag=direct
   ~# dd if=/dev/nvme0n2 of=/dev/null bs=1M status=progress

Compare results with directio and without.

“fio” tool. 
 
  1) write prepare:

       ~# fio  --name=prep --rw=write --verify=crc32 --loop=1 --numjobs=2  
--time_based --thread  --bs=1M --iodepth=32  --ioengine=libaio --direct=1  
--group_reporting  --size=20G  --filename=/dev/nvme0n2


 2)  read test:

      ~# fio  --name=readtest --rw=read —loop=30 --numjobs=2  --time_based 
--thread  —bs=256K --iodepth=32  --ioengine=libaio --direct=1  
--group_reporting  --size=20G  --filename=/dev/nvme0n2
     
—
Vitaliy  
> Thanks,
> 
> -Matthew
> 
> 
> 
>> ——
>> Vitaliy
>> 
>>> On 28 Feb 2024, at 21:29, Matthew Grooms <[email protected]> 
>>> <mailto:[email protected]> wrote:
>>> 
>>> On 2/27/24 04:21, Vitaliy Gusev wrote:
>>>> Hi,
>>>> 
>>>> 
>>>>> On 23 Feb 2024, at 18:37, Matthew Grooms <[email protected]> 
>>>>> <mailto:[email protected]> wrote:
>>>>> 
>>>>>> ...
>>>>> The problem occurs when an image file is used on either ZFS or UFS. The 
>>>>> problem also occurs when the virtual disk is backed by a raw disk 
>>>>> partition or a ZVOL. This issue isn't related to a specific underlying 
>>>>> filesystem.
>>>>> 
>>>> 
>>>> Do I understand right, you ran testing inside VM inside guest VM  on ext4 
>>>> filesystem? If so you should be aware about additional overhead in 
>>>> comparison when you were running tests on the hosts.
>>>> 
>>> Hi Vitaliy,
>>> 
>>> I appreciate you providing the feedback and suggestions. I spent over a 
>>> week trying as many combinations of host and guest options as possible to 
>>> narrow this issue down to a specific host storage or a guest device model 
>>> option. Unfortunately the problem occurred with every combination I tested 
>>> while running Linux as the guest. Note, I only tested RHEL8 & RHEL9 
>>> compatible distributions ( Alma & Rocky ). The problem did not occur when I 
>>> ran FreeBSD as the guest. The problem did not occur when I ran KVM in the 
>>> host and Linux as the guest.
>>> 
>>>> I would suggest to run fio (or even dd) on raw disk device inside VM, i.e. 
>>>> without filesystem at all.  Just do not forget do “echo 3 > 
>>>> /proc/sys/vm/drop_caches” in Linux Guest VM before you run tests. 
>>> The two servers I was using to test with are are no longer available. 
>>> However, I'll have two more identical servers arriving in the next week or 
>>> so. I'll try to run additional tests and report back here. I used bonnie++ 
>>> as that was easily installed from the package repos on all the systems I 
>>> tested.
>>> 
>>>> 
>>>> Could you also give more information about:
>>>> 
>>>>  1. What results did you get (decode bonnie++ output)?
>>> If you look back at this email thread, there are many examples of running 
>>> bonnie++ on the guest. I first ran the tests on the host system using Linux 
>>> + ext4 and FreeBSD 14 + UFS & ZFS to get a baseline of performance. Then I 
>>> ran bonnie++ tests using bhyve as the hypervisor and Linux & FreeBSD as the 
>>> guest. The combination of host and guest storage options included ...
>>> 
>>> 1) block device + virtio blk
>>> 2) block device + nvme
>>> 3) UFS disk image + virtio blk
>>> 4) UFS disk image + nvme
>>> 5) ZFS disk image + virtio blk
>>> 6) ZFS disk image + nvme
>>> 7) ZVOL + virtio blk
>>> 8) ZVOL + nvme
>>> 
>>> In every instance, I observed the Linux guest disk IO often perform very 
>>> well for some time after the guest was first booted. Then the performance 
>>> of the guest would drop to a fraction of the original performance. The 
>>> benchmark test was run every 5 or 10 minutes in a cron job. Sometimes the 
>>> guest would perform well for up to an hour before performance would drop 
>>> off. Most of the time it would only perform well for a few cycles ( 10 - 30 
>>> mins ) before performance would drop off. The only way to restore the 
>>> performance was to reboot the guest. Once I determined that the problem was 
>>> not specific to a particular host or guest storage option, I switched my 
>>> testing to only use a block device as backing storage on the host to avoid 
>>> hitting any system disk caches.
>>> 
>>> Here is the test script I used in the cron job ...
>>> 
>>> #!/bin/sh
>>> FNAME='output.txt'
>>> 
>>> echo 
>>> ================================================================================
>>>  >> $FNAME
>>> echo Begin @ `/usr/bin/date` >> $FNAME
>>> echo >> $FNAME
>>> /usr/sbin/bonnie++ 2>&1 | /usr/bin/grep -v 'done\|,' >> $FNAME
>>> echo >> $FNAME
>>> echo End @ `/usr/bin/date` >> $FNAME
>>> 
>>> As you can see, I'm calling bonnie++ with the system defaults. That uses a 
>>> data set size that's 2x the guest RAM in an attempt to minimize the effect 
>>> of filesystem cache on results. Here is an example of the output that 
>>> bonnie++ produces ...
>>> 
>>> Version  2.00       ------Sequential Output------ --Sequential Input- 
>>> --Random-
>>>                     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- 
>>> --Seeks--
>>> Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec 
>>> %CP
>>> linux-blk    63640M  694k  99  1.6g  99  737m  76  985k  99  1.3g  69 +++++ 
>>> +++
>>> Latency             11579us     535us   11889us    8597us   21819us    
>>> 8238us
>>> Version  2.00       ------Sequential Create------ --------Random 
>>> Create--------
>>> linux-blk           -Create-- --Read--- -Delete-- -Create-- --Read--- 
>>> -Delete--
>>>               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec 
>>> %CP
>>>                  16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ 
>>> +++
>>> Latency              7620us     126us    1648us     151us      15us     
>>> 633us
>>> 
>>> --------------------------------- speed drop 
>>> ---------------------------------
>>> 
>>> Version  2.00       ------Sequential Output------ --Sequential Input- 
>>> --Random-
>>>                     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- 
>>> --Seeks--
>>> Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec 
>>> %CP
>>> linux-blk    63640M  676k  99  451m  99  314m  93  951k  99  402m  99 15167 
>>> 530
>>> Latency             11902us    8959us   24711us   10185us   20884us    
>>> 5831us
>>> Version  2.00       ------Sequential Create------ --------Random 
>>> Create--------
>>> linux-blk           -Create-- --Read--- -Delete-- -Create-- --Read--- 
>>> -Delete--
>>>               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec 
>>> %CP
>>>                  16     0  96 +++++ +++ +++++ +++     0  96 +++++ +++     0 
>>>  75
>>> Latency               343us     165us    1636us     113us      55us    
>>> 1836us
>>> 
>>> In the example above, the benchmark test repeated about 20 times with 
>>> results that were similar to the performance shown above the dotted line ( 
>>> ~ 1.6g/s seq write and 1.3g/s seq read ). After that, the performance 
>>> dropped to what's shown below the dotted line which is less than 1/4 the 
>>> original speed ( ~ 451m/s seq write and 402m/s seq read ). 
>>> 
>>>>  2. What results expecting?
>>>> 
>>> What I expect is that, when I perform the same test with the same 
>>> parameters, the results would stay more or less consistent over time. This 
>>> is true when KVM is used as the hypervisor on the same hardware and guest 
>>> options. That said, I'm not worried about bhyve being consistently slower 
>>> than kvm or a FreeBSD guest being consistently slower than a Linux guest. 
>>> I'm concerned that the performance drop over time is indicative of an issue 
>>> with how bhyve interacts with non-freebsd guests.
>>> 
>>>>  3. VM configuration, virtio-blk disk size, etc.
>>>>  4. Full command for tests (including size of test-set), bhyve, etc.
>>> I believe this was answered above. Please let me know if you have 
>>> additional questions.
>>> 
>>>> 
>>>>  5. Did you pass virtio-blk as 512 or 4K ? If 512, probably you should try 
>>>> 4K.
>>>> 
>>> The testing performed was not exclusively with virtio-blk.
>>> 
>>> 
>>>>  6. Linux has several read-ahead options for IO schedule, and it could be 
>>>> related too.
>>>> 
>>> I suppose it's possible that bhyve could be somehow causing the disk 
>>> scheduler in the Linux guest to act differently. I'll see if I can figure 
>>> out how to disable that in future tests.
>>> 
>>> 
>>>> Additionally could also you play with “sync=disabled” volume/zvol option? 
>>>> Of course it is only for write testing.
>>> The testing performed was not exclusively with zvols.
>>> 
>>> 
>>> Once I have more hardware available, I'll try to report back with more 
>>> testing. It may be interesting to also see how a Windows guest performs 
>>> compared to Linux & FreeBSD. I suspect that this issue may only be 
>>> triggered when a fast disk array is in use on the host. My tests use a 16x 
>>> SSD RAID 10 array. It's also quite possible that the disk IO slowdown is 
>>> only a symptom of another issue that's triggered by the disk IO test ( 
>>> please see end of my last post related to scheduler priority observations 
>>> ). All I can say for sure is that ...
>>> 
>>> 1) There is a problem and it's reproducible across multiple hosts
>>> 2) It affects RHEL8 & RHEL9 guests but not FreeBSD guests
>>> 3) It is not specific to any host or guest storage option
>>> 
>>> Thanks,
>>> 
>>> -Matthew
>>>

Re: bhyve disk performance issue

Reply via email to