Okay, I found it's a known status quo: 
https://openzfs.org/wiki/Features#Improve_N-way_mirror_read_performance 
<https://openzfs.org/wiki/Features#Improve_N-way_mirror_read_performance>

And the tracking issue https://www.illumos.org/issues/4334 
<https://www.illumos.org/issues/4334> has been silent for about 8 years.

Found some original discussions in the illumos-zfs group:

https://illumos.topicbox.com/groups/zfs/T0bf2e401c173399b 
<https://illumos.topicbox.com/groups/zfs/T0bf2e401c173399b>

https://illumos.topicbox.com/groups/zfs/T15af07bb228f0a62 
<https://illumos.topicbox.com/groups/zfs/T15af07bb228f0a62>

I requested to join that group but still pending approval.

Not sure how many experts in that group also reachable through the 2 lists I'm 
posting to (i.e. [illumos-discuss] and [smartos-discuss]), I'd like to start 
some discussion right here and right now.

I have no expertise in RAID/storage implementation myself, but do software 
engineering (formerly in enterprise systems then HPC in quant systems) for a 
living. 

>From the architecture perspective, I see a simpler yet seemingly more 
>potential approach, even better than the [Improve N-way mirror read 
>performance](https://openzfs.org/wiki/Features#Improve_N-way_mirror_read_performance)
> feature.

I suggest to refactor the io queue hierarchy a bit - as inferred from the 
discussions, AIUI currently each member disk (vdev) of a mirror, possesses its 
own io queue, and the io scheduler actively choose a vdev for a specific io 
request (thus do the load balancing). My proposal is to have each mirror to 
possesses an io queue, which is shared by all its member disks, then all disks 
can work at their respective full potential, either read or write, so long as 
the mirror io queue not drained. We don't need to discriminate how fast each 
disk can run, yet never mind they are spinning or not.

This solution is at lease way more adaptive than above mentioned feature. It is 
also way much simpler to implement in theory, though refactoring current 
established queue structure w.r.t. the io scheduler could be certainly complex 
to some extent, e.g. to battle test for feature stability would require more 
time and effort.

What do you think?

Best regards,
Compl


> On 2022-01-24, at 17:58, YueCompl via illumos-discuss 
> <[email protected]> wrote:
> 
> So great to see it in action! Thanks gea for the benchmark!
> 
> Looks like your benchmark is done with OmniOS, so SSD+HDD mirror is really a 
> bad idea if on Illumos ZFS up to your OmniOS' version.
> 
> I can understand the benchmark result except for sole SSD randomrw.f be at 
> 72.8 MB/s (vs 118.2 MB/s), it's rather much slower than sole HD! 
> 
> Why? I don't understand but a wild guess - somehow the cache works in 
> write-through mode on SSD only vdev, while on HD involved vdev, the cache 
> works in write-back mode?
> 
> --
> 
> Then I googled more and found that OpenZFS people have tackled almost 
> identical situation with great progress:
> 
> https://github.com/openzfs/zfs/issues/1461#issuecomment-18399909 
> <https://github.com/openzfs/zfs/issues/1461#issuecomment-18399909>
> 
> > For example in my tests with 1 x disk and 1 x ram device:
> > 
> > Normal round robin: 80MB/s+80MB/s = 160MB/sec
> > Balancing off pending: 80MB/s + 200MB/s = 280MB/s
> > With slow vdevs offlined: 0MB/s + 400MB/s = 400MB/s
> 
> 
> https://github.com/openzfs/zfs/issues/1461#issuecomment-18783825 
> <https://github.com/openzfs/zfs/issues/1461#issuecomment-18783825>
> 
> > Just for reference here's the patch I've proposed running against a ssd+hdd 
> > pool. Because it sounds like your most interested in IO/s I've run fio 
> > using 16 threads and a 4k random read workloads.
> > 
> > $ zpool status -v
> >     NAME        STATE     READ WRITE CKSUM
> >     hybrid      ONLINE       0     0     0
> >       mirror-0  ONLINE       0     0     0
> >         ssd-1   ONLINE       0     0     0
> >         hdd-1   ONLINE       0     0     0
> >       mirror-1  ONLINE       0     0     0
> >         ssd-2   ONLINE       0     0     0
> >         hdd-2   ONLINE       0     0     0
> >       mirror-2  ONLINE       0     0     0
> >         ssd-3   ONLINE       0     0     0
> >         hdd-3   ONLINE       0     0     0
> >       mirror-3  ONLINE       0     0     0
> >         ssd-4   ONLINE       0     0     0
> >         hdd-4   ONLINE       0     0     0
> > 
> > $ iostat -mxz 5
> > Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz 
> > avgqu-sz   await  svctm  %util
> > hdd-1               0.00     0.00  156.80    0.00    19.60     0.00   
> > 256.00     2.52   16.09   6.37  99.88
> > hdd-2               0.00     0.00  149.80    0.00    18.73     0.00   
> > 256.00     2.51   16.78   6.66  99.80
> > hdd-3               0.00     0.00  150.40    0.00    18.80     0.00   
> > 256.00     2.52   16.74   6.63  99.72
> > hdd-4               0.00     0.00  154.20    0.00    19.28     0.00   
> > 256.00     2.46   15.98   6.47  99.70
> > ssd-1               0.00     0.00 1250.80    0.00   156.35     0.00   
> > 256.00     1.32    1.06   0.62  77.70
> > ssd-2               0.00     0.00 1237.20    0.00   154.65     0.00   
> > 256.00     1.31    1.06   0.62  77.16
> > ssd-3               0.00     0.00 1243.40    0.00   155.42     0.00   
> > 256.00     1.33    1.07   0.62  77.68
> > ssd-4               0.00     0.00 1227.60    0.00   153.45     0.00   
> > 256.00     1.29    1.05   0.63  77.36
> > 
> > $ fio rnd-read-4k
> > read: (groupid=0, jobs=16): err= 0: pid=13087
> >   read : io=1324.3MB, bw=22594KB/s, iops=5648 , runt= 60018msec
> >     clat (usec): min=6 , max=276720 , avg=2762.90, stdev=1493.94
> >      lat (usec): min=6 , max=276720 , avg=2763.11, stdev=1493.94
> >     bw (KB/s) : min=  109, max= 2743, per=6.40%, avg=1446.40, stdev=67.99
> >   cpu          : usr=0.16%, sys=1.37%, ctx=336494, majf=0, minf=432
> >   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
> > >=64=0.0%
> >      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
> > 64=0.0%,>=64=0.0%
> >      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
> > 64=0.0%,>=64=0.0%
> >      issued r/w/d: total=339009/0/0, short=0/0/0
> >      lat (usec): 10=0.53%, 20=2.81%, 50=0.02%, 100=0.01%, 750=0.01%
> >      lat (usec): 1000=33.99%
> >      lat (msec): 2=50.03%, 4=1.93%, 10=2.75%, 20=5.09%, 50=2.69%
> >      lat (msec): 100=0.15%, 250=0.01%, 500=0.01%
> > 
> > Run status group 0 (all jobs):
> >    READ: io=1324.3MB, aggrb=22593KB/s, minb=23136KB/s, maxb=23136KB/s,
> > mint=60018msec, maxt=60018msec
> > From the above results you can quickly see a few things.
> > 
> >   1. As expected the hard drives are nearly 100% utilized and limited at 
> > roughly 150 IO/s. The ssds are not quite maxed out but they are handling 
> > the vast majority of the IO/s at roughly 1200+ operations per second. Some 
> > additional tuning is probably possible to increase that utilization but 
> > that's not too shabby.
> > 
> >   2. Sustained IO/s for the pool is a respectable 5648, but as expected you 
> > have a bimodal distribution for the latency. Some IO/s are very fast when 
> > serviced by the SSD others are slow when the HDD is used, for this case the 
> > worst case latency was a quite large 276ms with an average of 2.7ms. We 
> > could probably do a bit more to bound the worst case latencies but I think 
> > this is a good start.
> 
> https://github.com/openzfs/zfs/issues/1461#issuecomment-20507619 
> <https://github.com/openzfs/zfs/issues/1461#issuecomment-20507619>
> 
> > This is better than a cure for cancer. Every time I watch you guys improve 
> > ZFS, it's like magic.
> 
> ---
> 
> That leads to the patch [Improve N-way mirror 
> performance](https://github.com/openzfs/zfs/commit/556011dbec2d10579819078559a77630fc559112
>  
> <https://github.com/openzfs/zfs/commit/556011dbec2d10579819078559a77630fc559112>)
>  merged into OpenZFS on Jul 12 2013, so seems a similar solution has not 
> found its way into Illumos ZFS?
> 
> Can we initiate some effort to backport that into Illumos ZFS?
> 
> Best regards,
> Compl
> 
> 
>> On 2022-01-23, at 22:48, Günther Ernst Alka <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> 
>> I have made some benchmarks of the following mixed mirror 
>> (fast Samsung SAS SSD 960, slow HGST 2TB HD, 12G HBA)
>> 
>> <p6HxZ77PXXgwzzSb.png>
>> 
>> a quick dd Benchmark of the hd alone (basic pool) gives
>> 
>> 12.8 GB in 94.3s = 143.82 MB/s Write
>> 12.8 GB in 104.3s = 130.48 MB/s Read
>> 
>> a dd Benchmark of the ssd alone (basic pool) gives
>> 
>> 12.8 GB in 23.3s = 549.36 MB/s Write
>> 12.8 GB in 16.8s = 870.75 MB/s Read
>> 
>> a dd Benchmark of the mirror gives
>> 
>> 12.8 GB in 94.4s = 135.59 MB/s Write
>> 12.8 GB in 68.6s = 186.59 MB/s Read
>> 
>> a multistream filebench benchmark of the mirror gives (filebench 
>> 5streamread/write to check effects of parallel io)
>> 
>> <cMT9iCkTNiNUk3LW.png>
>> 
>> 
>> Result from dd (singlstream):
>> Write mirror vs disks: Writing to the mixed pool is as slow as to the hd
>> Read mirror vs disks: Reading from the mixed pool is only slightly faster as 
>> from the hd
>> 
>> 
>> singlestream filebench results
>> 
>>                                 randomread.f     randomrw.f     singlestreamr
>> pri/sec cache=all  (hd)         103.4 MB/s       118.2 MB/s     1.0 GB/s
>> 
>>                                 randomread.f     randomrw.f     singlestreamr
>> pri/sec cache=all (ssd)         120.6 MB/s       72.8 MB/s      1.1 GB/s  
>> 
>>                                 randomread.f     randomrw.f     singlestreamr
>> pri/sec cache=all (mirror)      92.8 MB/s        117.0 MB/s     990.4 MB/s   
>>                  
>> ________________________________________________________________________________________
>> 
>> 
>> Result from filebench  (multistream):
>>    Write mirror vs disks: Writing to the mixed pool is as slow as to the hd
>>    Read mirror vs disks: Single/Random Reading from the mixed pool is as 
>> slow as from the hd
>> 
>> Filebench result
>>    Single/Random Reading from the mixed pool is as slow as from the hd.
>>    random r/w of the mixed pool is better than hd or ssd alone
>>    Multistream Reading from the mixed pool is twice as fast as singlestream 
>> read from the ssd or hd
>>    and nearly twice as good as singlstreamread from the mixed pool
>>    (read values are too good, readcache seems to improve values massively 
>> but shows indeed the effect of parallel multistream reading)
>> 
>> In the end especially the bad write and weak read results indicate that a 
>> mixed mirror is a bad idea
>> 
>> gea
>> 
>> 
>> 
> 
> illumos <https://illumos.topicbox.com/latest> / illumos-discuss / see 
> discussions <https://illumos.topicbox.com/groups/discuss> + participants 
> <https://illumos.topicbox.com/groups/discuss/members> + delivery options 
> <https://illumos.topicbox.com/groups/discuss/subscription>Permalink 
> <https://illumos.topicbox.com/groups/discuss/Tc34977e2e913194b-M52baae03e27a1259fcd51dcf>

------------------------------------------
illumos: illumos-discuss
Permalink: 
https://illumos.topicbox.com/groups/discuss/T5df6834e7fbf78a8-Mcc055aaf74d673bada3486ea
Delivery options: https://illumos.topicbox.com/groups/discuss/subscription

Reply via email to