Hi!

I think such a solution as part of the filesystem could do much better than 
something outside of it (like bcache). But I'm not sure: What makes data 
hot? I think the most benefit is detecting random read access and mark only 
those data as hot, also writes should go to the SSD first and then should be 
spooled to the harddisks in background. Bcache does a lot regarding this.

Since this is within the filesystem, users could even mark files as being 
always "hot" with some attribute or ioctl. This could be used by a boot-
readahead and preload implementation to automatically make files hot used 
during booting or for preloading when I start an application.

On the other side hot relocation should be able to reduce writes to the SSD 
as good as possible, for example: Do not defragment files during autodefrag, 
it makes no sense. Also write data in bursts of erase block size etc.

And also important: What if the SSD dies due to wearing? Will it gracefully 
fall back to harddisk? What does "relocation" mean? Files (hot data) should 
only be cached in copy to SSD, and not moved there. It should be possible 
for btrfs to just drop a failing SSD from the filesystem without data loss 
because otherwise one should use two SSDs in raid-1 mode to get a safe cache 
storage.

Altogether I think that a spinning media btrfs raid can outperform a single 
SSD so hot relocation should probably be used to reduce head movements 
because this is where SSD really excels. So everything that involves heavy 
head movement should go to SSD first, then written back to harddisk. And I 
think there's a lot potential to optimize because a COW filesystem like 
btrfs naturally has a lot of head movement.

What do you think?

BTW: I have not tried the one or the other yet because I'm still deciding 
which way to go. Your patches are more welcome because I do not need to 
migrate my storage to bcache-provided block devices. OTOH the bcache 
implementation looks a lot more mature (with regard to performance and 
safety) at this point because it provides many of the above mentioned 
features - most importantly gracefully handling failing SSDs.

Regarding btrfs raid outperforms SSD: During boot my spinning media 3 device 
btrfs raid reads boot files with up to 600 MB/s (from LZ compressed fs), 
boot takes about 7 seconds until the display manager starts (which takes 
another 30 seconds but that's another story), and the system is pretty 
crowded with services I actually wouldn't need if I optimized for boot 
performance. But I think systemd's read-ahead implementation has a lot 
influence on this fast booting: It defragments and relocates boot files on 
btrfs during boot so the harddisks can sequentially read all this stuff. I 
think it also compresses boot files if compression is enabled because 
booting is IO bound, not CPU bound. Benchmarks showed that my btrfs raid 
could technically read up to 450 MB/s, so I think the 600 MB/s counts for 
decompressed data. A single SSD could not do that. For that same reason I 
created a small script to defragment and compress files used by the preload 
daemon. Without benchmarking it, this felt like another small performance 
boost. So I'm eager what could be next with some sort of SSD cache because 
the only problem left seems to be heavy head movement which slows down the 
system.

Zhi Yong Wu <zwu.ker...@gmail.com> schrieb:

> HI,
> 
>    What do you think if its design approach goes correctly? Do you
> have any comments or better design idea for BTRFS hot relocation
> support? any comments are appreciated, thanks.
> 
> 
> On Mon, May 6, 2013 at 4:53 PM,  <zwu.ker...@gmail.com> wrote:
>> From: Zhi Yong Wu <wu...@linux.vnet.ibm.com>
>>
>>   The patchset as RFC is sent out mainly to see if it goes in the
>> correct development direction.
>>
>>   The patchset is trying to introduce hot relocation support
>> for BTRFS. In hybrid storage environment, when the data in
>> HDD disk get hot, it can be relocated to SSD disk by BTRFS
>> hot relocation support automatically; also, if SSD disk ratio
>> exceed its upper threshold, the data which get cold can be
>> looked up and relocated to HDD disk to make more space in SSD
>> disk at first, and then the data which get hot will be relocated
>> to SSD disk automatically.
>>
>>   BTRFS hot relocation mainly reserve block space from SSD disk
>> at first, load the hot data to page cache from HDD, allocate
>> block space from SSD disk, and finally write the data to SSD disk.
>>
>>   If you'd like to play with it, pls pull the patchset from
>> my git on github:
>>   https://github.com/wuzhy/kernel.git hot_reloc
>>
>> For how to use, please refer too the example below:
>>
>> root@debian-i386:~# echo 0 > /sys/block/vdc/queue/rotational
>> ^^^ Above command will hack /dev/vdc to be one SSD disk
>> root@debian-i386:~# echo 999999 > /proc/sys/fs/hot-age-interval
>> root@debian-i386:~# echo 10 > /proc/sys/fs/hot-update-interval
>> root@debian-i386:~# echo 10 > /proc/sys/fs/hot-reloc-interval
>> root@debian-i386:~# mkfs.btrfs -d single -m single -h /dev/vdb /dev/vdc
>> -f
>>
>> WARNING! - Btrfs v0.20-rc1-254-gb0136aa-dirty IS EXPERIMENTAL
>> WARNING! - see http://btrfs.wiki.kernel.org before using
>>
>> [ 140.279011] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 1
>> [ transid 16 /dev/vdb 140.283650] device fsid
>> [ c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 2 transid 16 /dev/vdc
>> [ 140.517089] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1
>> [ transid 3 /dev/vdb 140.550759] device fsid
>> [ 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1 transid 3 /dev/vdb
>> [ 140.552473] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 2
>> [ transid 16 /dev/vdc
>> adding device /dev/vdc id 2
>> [ 140.636215] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 2
>> [ transid 3 /dev/vdc
>> fs created label (null) on /dev/vdb
>> nodesize 4096 leafsize 4096 sectorsize 4096 size 14.65GB
>> Btrfs v0.20-rc1-254-gb0136aa-dirty
>> root@debian-i386:~# mount -o hot_move /dev/vdb /data2
>> [ 144.855471] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1
>> [ transid 6 /dev/vdb 144.870444] btrfs: disk space caching is enabled
>> [ 144.904214] VFS: Turning on hot data tracking
>> root@debian-i386:~# dd if=/dev/zero of=/data2/test1 bs=1M count=2048
>> 2048+0 records in
>> 2048+0 records out
>> 2147483648 bytes (2.1 GB) copied, 23.4948 s, 91.4 MB/s
>> root@debian-i386:~# df -h
>> Filesystem Size Used Avail Use% Mounted on
>> /dev/vda1 16G 13G 2.2G 86% /
>> tmpfs 4.8G 0 4.8G 0% /lib/init/rw
>> udev 10M 176K 9.9M 2% /dev
>> tmpfs 4.8G 0 4.8G 0% /dev/shm
>> /dev/vdb 15G 2.0G 13G 14% /data2
>> root@debian-i386:~# btrfs fi df /data2
>> Data: total=3.01GB, used=2.00GB
>> System: total=4.00MB, used=4.00KB
>> Metadata: total=8.00MB, used=2.19MB
>> Data_SSD: total=8.00MB, used=0.00
>> root@debian-i386:~# echo 108 > /proc/sys/fs/hot-reloc-threshold
>> ^^^ Above command will start HOT RLEOCATE, because The data temperature
>> is currently 109 root@debian-i386:~# df -h
>> Filesystem Size Used Avail Use% Mounted on
>> /dev/vda1 16G 13G 2.2G 86% /
>> tmpfs 4.8G 0 4.8G 0% /lib/init/rw
>> udev 10M 176K 9.9M 2% /dev
>> tmpfs 4.8G 0 4.8G 0% /dev/shm
>> /dev/vdb 15G 2.1G 13G 14% /data2
>> root@debian-i386:~# btrfs fi df /data2
>> Data: total=3.01GB, used=6.25MB
>> System: total=4.00MB, used=4.00KB
>> Metadata: total=8.00MB, used=2.26MB
>> Data_SSD: total=2.01GB, used=2.00GB
>> root@debian-i386:~#
>>
>> Zhi Yong Wu (5):
>>   vfs: add one list_head field
>>   btrfs: add one new block group
>>   btrfs: add one hot relocation kthread
>>   procfs: add three proc interfaces
>>   btrfs: add hot relocation support
>>
>>  fs/btrfs/Makefile            |   3 +-
>>  fs/btrfs/ctree.h             |  26 +-
>>  fs/btrfs/extent-tree.c       | 107 +++++-
>>  fs/btrfs/extent_io.c         |  31 +-
>>  fs/btrfs/extent_io.h         |   4 +
>>  fs/btrfs/file.c              |  36 +-
>>  fs/btrfs/hot_relocate.c      | 802
>>  +++++++++++++++++++++++++++++++++++++++++++
>>  fs/btrfs/hot_relocate.h      |  48 +++
>>  fs/btrfs/inode-map.c         |  13 +-
>>  fs/btrfs/inode.c             |  92 ++++-
>>  fs/btrfs/ioctl.c             |  23 +-
>>  fs/btrfs/relocation.c        |  14 +-
>>  fs/btrfs/super.c             |  30 +-
>>  fs/btrfs/volumes.c           |  28 +-
>>  fs/hot_tracking.c            |   1 +
>>  include/linux/btrfs.h        |   4 +
>>  include/linux/hot_tracking.h |   1 +
>>  kernel/sysctl.c              |  22 ++
>>  18 files changed, 1234 insertions(+), 51 deletions(-)
>>  create mode 100644 fs/btrfs/hot_relocate.c
>>  create mode 100644 fs/btrfs/hot_relocate.h
>>
>> --
>> 1.7.11.7
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to