speed up big btrfs volumes with ssds
Hello, i'm trying to speed up big btrfs volumes. Some facts: - Kernel will be 4.13-rc7 - needed volume size is 60TB Currently without any ssds i get the best speed with: - 4x HW Raid 5 with 1GB controller memory of 4TB 3,5" devices and using btrfs as raid 0 for data and metadata on top of those 4 raid 5. I can live with a data loss every now and and than ;-) so a raid 0 on top of the 4x radi5 is acceptable for me. Currently the write speed is not as good as i would like - especially for random 8k-16k I/O. My current idea is to use a pcie flash card with bcache on top of each raid 5. Is this something which makes sense to speed up the write speed. Greets, Stefan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: speed up big btrfs volumes with ssds
> [ ... ] - needed volume size is 60TB I wonder how long that takes to 'scrub', 'balance', 'check', 'subvolume delete', 'find', etc. > [ ... ] 4x HW Raid 5 with 1GB controller memory of 4TB 3,5" > devices and using btrfs as raid 0 for data and metadata on top > of those 4 raid 5. [ ... ] the write speed is not as good as > i would like - especially for random 8k-16k I/O. [ ... ] Also I noticed that the rain is wet and cold - especially if one walks around for a few hours in a t-shirt, shorts and sandals. :-) > My current idea is to use a pcie flash card with bcache on top > of each raid 5. Is this something which makes sense to speed > up the write speed. Well 'bcache' in the role of write buffer allegedly helps turning unaligned writes into aligned writes, so might help, but I wonder how effective that will be in this case, plus it won't turn low random IOPS-per-TB 4TB devices into high ones. Anyhow if they are battery-backed the 1GB of HW HBA cache/buffer should do exactly that, excep that again in this case that is rather optimistic. But this reminds me of the common story: "Doctor, if I stab repeatedly my hand with a fork it hurts a lot, how to fix that?" "Don't do it". :-) PS Random writes of 8-16KiB over 60TB might seem like storing small records/images in small files. That would be "brave". On a 60TB RAID50 of 20x 4TB disk drives that might mean around 5-10MB/s of random small writes, including both data and metadata. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: speed up big btrfs volumes with ssds
On Sun, Sep 3, 2017 at 8:32 PM, Stefan Priebe - Profihost AG wrote: > Hello, > > i'm trying to speed up big btrfs volumes. > > Some facts: > - Kernel will be 4.13-rc7 > - needed volume size is 60TB > > Currently without any ssds i get the best speed with: > - 4x HW Raid 5 with 1GB controller memory of 4TB 3,5" devices > > and using btrfs as raid 0 for data and metadata on top of those 4 raid 5. > > I can live with a data loss every now and and than ;-) so a raid 0 on > top of the 4x radi5 is acceptable for me. > > Currently the write speed is not as good as i would like - especially > for random 8k-16k I/O. > > My current idea is to use a pcie flash card with bcache on top of each > raid 5. If it can speed up depends quite a lot on what the use-case is, for some not-so-much-parallel-access it might work. So this 60TB is then 20 4TB disks or so and the 4x 1GB cache is simply not very helpful I think. The working set doesn't fit in it I guess. If there is mostly single or a few users of the fs, a single pcie based bcacheing 4 devices can work, but for SATA SSD, I would use 1 SSD per HWraid5. Then roughly make sure the complete set of metadata blocks fits in the cache. For an fs of this size let's say/estimate 150G. Then maybe same of double for data, so an SSD of 500G would be a first try. You give the impression that reliability for this fs is not the highest prio, so if you go full risk, then put bcache in write-back mode, then you will have your desired random 8k-16k I/O speedup after the cache is warmed up. But any SW or HW failure wil result in total fs loss normally if SSD and HDD get out of sync somehow. Bcache write-through might also be acceptable, you will need extensive monitoring and tuning of all (bcache) parameters etc to be sure of the right choice of size and setup etc. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: speed up big btrfs volumes with ssds
>> [ ... ] Currently the write speed is not as good as i would >> like - especially for random 8k-16k I/O. [ ... ] > [ ... ] So this 60TB is then 20 4TB disks or so and the 4x 1GB > cache is simply not very helpful I think. The working set > doesn't fit in it I guess. If there is mostly single or a few > users of the fs, a single pcie based bcacheing 4 devices can > work, but for SATA SSD, I would use 1 SSD per HWraid5. [ ... ] Probably the idea of the cacheable working set of a random small write workload is a major new scientific discovery. :-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: speed up big btrfs volumes with ssds
Am 04.09.2017 um 12:53 schrieb Henk Slager: > On Sun, Sep 3, 2017 at 8:32 PM, Stefan Priebe - Profihost AG > wrote: >> Hello, >> >> i'm trying to speed up big btrfs volumes. >> >> Some facts: >> - Kernel will be 4.13-rc7 >> - needed volume size is 60TB >> >> Currently without any ssds i get the best speed with: >> - 4x HW Raid 5 with 1GB controller memory of 4TB 3,5" devices >> >> and using btrfs as raid 0 for data and metadata on top of those 4 raid 5. >> >> I can live with a data loss every now and and than ;-) so a raid 0 on >> top of the 4x radi5 is acceptable for me. >> >> Currently the write speed is not as good as i would like - especially >> for random 8k-16k I/O. >> >> My current idea is to use a pcie flash card with bcache on top of each >> raid 5. > > If it can speed up depends quite a lot on what the use-case is, for > some not-so-much-parallel-access it might work. So this 60TB is then > 20 4TB disks or so and the 4x 1GB cache is simply not very helpful I > think. The working set doesn't fit in it I guess. If there is mostly > single or a few users of the fs, a single pcie based bcacheing 4 > devices can work, but for SATA SSD, I would use 1 SSD per HWraid5. Yes that's roughly my idea as well and yes the workload is 4 users max writing data. 50% sequential, 50% random. > Then roughly make sure the complete set of metadata blocks fits in the > cache. For an fs of this size let's say/estimate 150G. Then maybe same > of double for data, so an SSD of 500G would be a first try. I would use 1TB devices for each Raid or a 4TB PCIe card. > You give the impression that reliability for this fs is not the > highest prio, so if you go full risk, then put bcache in write-back > mode, then you will have your desired random 8k-16k I/O speedup after > the cache is warmed up. But any SW or HW failure wil result in total > fs loss normally if SSD and HDD get out of sync somehow. Bcache > write-through might also be acceptable, you will need extensive > monitoring and tuning of all (bcache) parameters etc to be sure of the > right choice of size and setup etc. Yes i wanted to use the write back mode. Has anybody already made some test or experience with a setup like this? Greets, Stefan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: speed up big btrfs volumes with ssds
2017-09-04 15:57 GMT+03:00 Stefan Priebe - Profihost AG : > Am 04.09.2017 um 12:53 schrieb Henk Slager: >> On Sun, Sep 3, 2017 at 8:32 PM, Stefan Priebe - Profihost AG >> wrote: >>> Hello, >>> >>> i'm trying to speed up big btrfs volumes. >>> >>> Some facts: >>> - Kernel will be 4.13-rc7 >>> - needed volume size is 60TB >>> >>> Currently without any ssds i get the best speed with: >>> - 4x HW Raid 5 with 1GB controller memory of 4TB 3,5" devices >>> >>> and using btrfs as raid 0 for data and metadata on top of those 4 raid 5. >>> >>> I can live with a data loss every now and and than ;-) so a raid 0 on >>> top of the 4x radi5 is acceptable for me. >>> >>> Currently the write speed is not as good as i would like - especially >>> for random 8k-16k I/O. >>> >>> My current idea is to use a pcie flash card with bcache on top of each >>> raid 5. >> >> If it can speed up depends quite a lot on what the use-case is, for >> some not-so-much-parallel-access it might work. So this 60TB is then >> 20 4TB disks or so and the 4x 1GB cache is simply not very helpful I >> think. The working set doesn't fit in it I guess. If there is mostly >> single or a few users of the fs, a single pcie based bcacheing 4 >> devices can work, but for SATA SSD, I would use 1 SSD per HWraid5. > > Yes that's roughly my idea as well and yes the workload is 4 users max > writing data. 50% sequential, 50% random. > >> Then roughly make sure the complete set of metadata blocks fits in the >> cache. For an fs of this size let's say/estimate 150G. Then maybe same >> of double for data, so an SSD of 500G would be a first try. > > I would use 1TB devices for each Raid or a 4TB PCIe card. > >> You give the impression that reliability for this fs is not the >> highest prio, so if you go full risk, then put bcache in write-back >> mode, then you will have your desired random 8k-16k I/O speedup after >> the cache is warmed up. But any SW or HW failure wil result in total >> fs loss normally if SSD and HDD get out of sync somehow. Bcache >> write-through might also be acceptable, you will need extensive >> monitoring and tuning of all (bcache) parameters etc to be sure of the >> right choice of size and setup etc. > > Yes i wanted to use the write back mode. Has anybody already made some > test or experience with a setup like this? > > Greets, > Stefan > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html May be you can make work your raid setup faster by: 1. Use Single Profile 2. Use different stripe size for HW RAID5: i think 16kb will be optimal with 5 devices per raid group That will give you 64kb data stripe and 16kb parity Btrfs raid0 use 64kb as stripe so that can make data access unaligned (or use single profile for btrfs) 3. Use btrfs ssd_spread to decrease RMW cycles. Thanks. -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: speed up big btrfs volumes with ssds
On Monday, 4 September 2017 2:57:18 PM AEST Stefan Priebe - Profihost AG wrote: > > Then roughly make sure the complete set of metadata blocks fits in the > > cache. For an fs of this size let's say/estimate 150G. Then maybe same > > of double for data, so an SSD of 500G would be a first try. > > I would use 1TB devices for each Raid or a 4TB PCIe card. One thing I've considered is to create a filesystem with a RAID-1 of SSDs and then create lots of files with long names to use up a lot of space on the SSDs. Then delete those files and add disks to the filesystem. Then BTRFS should keep using the allocated metadata blocks on the SSD for all metadata and use disks for just data. I haven't yet tried bcache, but would prefer something simpler with one less layer. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: speed up big btrfs volumes with ssds
>>> [ ... ] Currently without any ssds i get the best speed with: >>> - 4x HW Raid 5 with 1GB controller memory of 4TB 3,5" devices >>> and using btrfs as raid 0 for data and metadata on top of >>> those 4 raid 5. [ ... ] the write speed is not as good as i >>> would like - especially for random 8k-16k I/O. [ ... ] > [ ... ] 64kb data stripe and 16kb parity Btrfs raid0 use 64kb > as stripe so that can make data access unaligned (or use single > profile for btrfs) 3. Use btrfs ssd_spread to decrease RMW > cycles. This is not a "revolutionary" scientific discovery as the idea of a working set of a small-size random-write workload, but it still takes a lot of "optimism" to imagine that it is possible to "decrease RMW cycles" for "random 8k-16k" writes on 64KiB+16KiB RAID5 stripes, whether with 'ssd_spread' or not. To "decrease RMW cycles" seems inded to me the better aim than following the "radical" aim of caching the working set of a random-small-write workload, but it may be less easy to achieve than desirable :-). http://www.baarf.dk/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: speed up big btrfs volumes with ssds
Am 04.09.2017 um 15:28 schrieb Timofey Titovets: > 2017-09-04 15:57 GMT+03:00 Stefan Priebe - Profihost AG > : >> Am 04.09.2017 um 12:53 schrieb Henk Slager: >>> On Sun, Sep 3, 2017 at 8:32 PM, Stefan Priebe - Profihost AG >>> wrote: Hello, i'm trying to speed up big btrfs volumes. Some facts: - Kernel will be 4.13-rc7 - needed volume size is 60TB Currently without any ssds i get the best speed with: - 4x HW Raid 5 with 1GB controller memory of 4TB 3,5" devices and using btrfs as raid 0 for data and metadata on top of those 4 raid 5. I can live with a data loss every now and and than ;-) so a raid 0 on top of the 4x radi5 is acceptable for me. Currently the write speed is not as good as i would like - especially for random 8k-16k I/O. My current idea is to use a pcie flash card with bcache on top of each raid 5. >>> >>> If it can speed up depends quite a lot on what the use-case is, for >>> some not-so-much-parallel-access it might work. So this 60TB is then >>> 20 4TB disks or so and the 4x 1GB cache is simply not very helpful I >>> think. The working set doesn't fit in it I guess. If there is mostly >>> single or a few users of the fs, a single pcie based bcacheing 4 >>> devices can work, but for SATA SSD, I would use 1 SSD per HWraid5. >> >> Yes that's roughly my idea as well and yes the workload is 4 users max >> writing data. 50% sequential, 50% random. >> >>> Then roughly make sure the complete set of metadata blocks fits in the >>> cache. For an fs of this size let's say/estimate 150G. Then maybe same >>> of double for data, so an SSD of 500G would be a first try. >> >> I would use 1TB devices for each Raid or a 4TB PCIe card. >> >>> You give the impression that reliability for this fs is not the >>> highest prio, so if you go full risk, then put bcache in write-back >>> mode, then you will have your desired random 8k-16k I/O speedup after >>> the cache is warmed up. But any SW or HW failure wil result in total >>> fs loss normally if SSD and HDD get out of sync somehow. Bcache >>> write-through might also be acceptable, you will need extensive >>> monitoring and tuning of all (bcache) parameters etc to be sure of the >>> right choice of size and setup etc. >> >> Yes i wanted to use the write back mode. Has anybody already made some >> test or experience with a setup like this? >> > > May be you can make work your raid setup faster by: > 1. Use Single Profile I'm already using the raid0 profile - see below: Data,RAID0: Size:22.57TiB, Used:21.08TiB Metadata,RAID0: Size:90.00GiB, Used:82.28GiB System,RAID0: Size:64.00MiB, Used:1.53MiB > 2. Use different stripe size for HW RAID5: > i think 16kb will be optimal with 5 devices per raid group > That will give you 64kb data stripe and 16kb parity > Btrfs raid0 use 64kb as stripe so that can make data access > unaligned (or use single profile for btrfs) That sounds like an interesting idea except for the unaligned writes. Will need to test this. > 3. Use btrfs ssd_spread to decrease RMW cycles. Can you explain this? Stefan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: speed up big btrfs volumes with ssds
2017-09-04 21:32 GMT+03:00 Stefan Priebe - Profihost AG : >> May be you can make work your raid setup faster by: >> 1. Use Single Profile > > I'm already using the raid0 profile - see below: If i understand correctly, you have a very big data set with random RW access, so: I'm saying about single profile for compact writes to one device, that can make WB cache more effective Because writes will not spread on several devices and as result that increase chance that full stripe will be overwriten That's will just work as raid0 with very big stripe size. > Data,RAID0: Size:22.57TiB, Used:21.08TiB > Metadata,RAID0: Size:90.00GiB, Used:82.28GiB > System,RAID0: Size:64.00MiB, Used:1.53MiB > >> 2. Use different stripe size for HW RAID5: >> i think 16kb will be optimal with 5 devices per raid group >> That will give you 64kb data stripe and 16kb parity >> Btrfs raid0 use 64kb as stripe so that can make data access >> unaligned (or use single profile for btrfs) > > That sounds like an interesting idea except for the unaligned writes. > Will need to test this. Afaik btrfs also use 64kb for metadata: https://github.com/torvalds/linux/blob/e26f1bea3b833fb2c16fb5f0a949da1efa219de3/fs/btrfs/extent-tree.c#L6678 >> 3. Use btrfs ssd_spread to decrease RMW cycles. > Can you explain this? Long description: https://www.spinics.net/lists/linux-btrfs/msg67515.html Short: that option will change allocator logic. Allocator will spread writes more aggressively and always try write to new/empty area. So in theory that will write new data to new empty chunk, so if you have much free space that will make some guaranty to not touch old data, so not do RWM and in theory always do full stripe write. But if you expect that you array will be near full and you don't want do defragment on that, that can easy get you enospace error. > Stefan That just my IMHO, Thanks. -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: speed up big btrfs volumes with ssds
Hello, Am 04.09.2017 um 20:32 schrieb Stefan Priebe - Profihost AG: > Am 04.09.2017 um 15:28 schrieb Timofey Titovets: >> 2017-09-04 15:57 GMT+03:00 Stefan Priebe - Profihost AG >> : >>> Am 04.09.2017 um 12:53 schrieb Henk Slager: On Sun, Sep 3, 2017 at 8:32 PM, Stefan Priebe - Profihost AG wrote: > Hello, > > i'm trying to speed up big btrfs volumes. > > Some facts: > - Kernel will be 4.13-rc7 > - needed volume size is 60TB > > Currently without any ssds i get the best speed with: > - 4x HW Raid 5 with 1GB controller memory of 4TB 3,5" devices > > and using btrfs as raid 0 for data and metadata on top of those 4 raid 5. > > I can live with a data loss every now and and than ;-) so a raid 0 on > top of the 4x radi5 is acceptable for me. > > Currently the write speed is not as good as i would like - especially > for random 8k-16k I/O. > > My current idea is to use a pcie flash card with bcache on top of each > raid 5. If it can speed up depends quite a lot on what the use-case is, for some not-so-much-parallel-access it might work. So this 60TB is then 20 4TB disks or so and the 4x 1GB cache is simply not very helpful I think. The working set doesn't fit in it I guess. If there is mostly single or a few users of the fs, a single pcie based bcacheing 4 devices can work, but for SATA SSD, I would use 1 SSD per HWraid5. >>> >>> Yes that's roughly my idea as well and yes the workload is 4 users max >>> writing data. 50% sequential, 50% random. >>> Then roughly make sure the complete set of metadata blocks fits in the cache. For an fs of this size let's say/estimate 150G. Then maybe same of double for data, so an SSD of 500G would be a first try. >>> >>> I would use 1TB devices for each Raid or a 4TB PCIe card. >>> You give the impression that reliability for this fs is not the highest prio, so if you go full risk, then put bcache in write-back mode, then you will have your desired random 8k-16k I/O speedup after the cache is warmed up. But any SW or HW failure wil result in total fs loss normally if SSD and HDD get out of sync somehow. Bcache write-through might also be acceptable, you will need extensive monitoring and tuning of all (bcache) parameters etc to be sure of the right choice of size and setup etc. >>> >>> Yes i wanted to use the write back mode. Has anybody already made some >>> test or experience with a setup like this? >>> >> >> May be you can make work your raid setup faster by: >> 1. Use Single Profile > > I'm already using the raid0 profile - see below: > > Data,RAID0: Size:22.57TiB, Used:21.08TiB > Metadata,RAID0: Size:90.00GiB, Used:82.28GiB > System,RAID0: Size:64.00MiB, Used:1.53MiB > >> 2. Use different stripe size for HW RAID5: >> i think 16kb will be optimal with 5 devices per raid group >> That will give you 64kb data stripe and 16kb parity >> Btrfs raid0 use 64kb as stripe so that can make data access >> unaligned (or use single profile for btrfs) > > That sounds like an interesting idea except for the unaligned writes. > Will need to test this. > >> 3. Use btrfs ssd_spread to decrease RMW cycles. > Can you explain this? > > Stefan i was able to fix this issue with ssd_spread. Could it be that the default allocators nossd and ssd are searching to hard to free space? Even space_tree did not help. Greets, Stefan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html