On 05/23/2013 03:34 PM, Chris Mason wrote: > Quoting Bernd Schubert (2013-05-23 09:22:41) >> On 05/23/2013 03:11 PM, Chris Mason wrote: >>> Quoting Bernd Schubert (2013-05-23 08:55:47) >>>> Hi all, >>>> >>>> we got a new test system here and I just also tested btrfs raid6 on >>>> that. Write performance is slightly lower than hw-raid (LSI megasas) and >>>> md-raid6, but it probably would be much better than any of these two, if >>>> it wouldn't read all the during the writes. Is this a known issue? This >>>> is with linux-3.9.2. >>> >>> Hi Bernd, >>> >>> Any time you do a write smaller than a full stripe, we'll have to do a >>> read/modify/write cycle to satisfy it. This is true of md raid6 and the >>> hw-raid as well, but their reads don't show up in vmstat (try iostat >>> instead). >> >> Yeah, I know and I'm using iostat already. md raid6 does not do rmw, but >> does not fill the device queue, afaik it flushes the underlying devices >> quickly as it does not have barrier support - that is another topic, but >> was the reason why I started to test btrfs. > > md should support barriers with recent kernels. You might want to > verify with blktrace that md raid6 isn't doing r/m/w. > >> >>> >>> So the bigger question is where are your small writes coming from. If >>> they are metadata, you can use raid1 for the metadata. >> >> I used this command >> >> /tmp/mkfs.btrfs -L test2 -f -d raid6 -m raid10 /dev/sd[m-x] > > Ok, the stripe size is 64KB, so you want to do IO in multiples of 64KB > times the number of devices on the FS. If you have 13 devices, that's > 832K.
Actually I have 12 devices, but we have to subtract 2 parity disks. In the mean time I also patched btrfsprogs to use a chunksize of 256K. So that should be 2560kiB now if I found the right places. Btw, any chance to generally use chunksize/chunklen instead of stripe, such as the md layer does it? IMHO it is less confusing to use n-datadisks * chunksize = stripesize. > > Using buffered writes makes it much more likely the VM will break up the > IOs as they go down. The btrfs writepages code does try to do full > stripe IO, and it also caches stripes as the IO goes down. But for > buffered IO it is surprisingly hard to get a 100% hit rate on full > stripe IO at larger stripe sizes. I have not found that part yet, somehow it looks like as if writepages would submit single pages to another layer. I'm going to look into it again during the weekend. I can reserve the hardware that long, but I think we first need to fix striped writes in general. > >> >> so meta-data should be raid10. And I'm using this iozone command: >> >> >>> iozone -e -i0 -i1 -r1m -l 5 -u 5 -s20g -+n \ >>> -F /data/fhgfs/storage/md126/testfile1 >>> /data/fhgfs/storage/md126/testfile2 /data/fhgfs/storage/md126/testfile3 \ >>> /data/fhgfs/storage/md127/testfile1 >>> /data/fhgfs/storage/md127/testfile2 /data/fhgfs/storage/md127/testfile3 >> >> >> Higher IO sizes (e.g. -r16m) don't make a difference, it goes through >> the page cache anyway. >> I'm not familiar with btrfs code at all, but maybe writepages() submits >> too small IOs? >> >> Hrmm, just wanted to try direct IO, but then just noticed it went into >> RO mode before already: > > Direct IO will make it easier to get full stripe writes. I thought I > had fixed this abort, but it is just running out of space to write the > inode cache. For now, please just don't mount with the inode cache > enabled, I'll send in a fix for the next rc. Thanks, I already noticed and disabled the inode cache. Direct-io works as expected and without any RMW cycles. And that provides more than 40% better performance than the Megasas controller or buffered MD writes (I didn't compare with direct-io MD, as that is very slow). Cheers, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html