Re: Any suggestions for thousands of disk image snapshots ?

2016-07-26 Thread Austin S. Hemmelgarn

On 2016-07-26 10:42, Chris Murphy wrote:

On Tue, Jul 26, 2016 at 3:37 AM, Kurt Seo  wrote:

2016-07-26 5:49 GMT+09:00 Chris Murphy :

On Mon, Jul 25, 2016 at 1:25 AM, Kurt Seo  wrote:

 Hi all


 I am currently running a project for building servers with btrfs.
Purposes of servers are exporting disk images through iscsi targets
and disk images are generated from btrfs subvolume snapshot.


How is the disk image generated from Btrfs subvolume snapshot?

On what file system is the disk image stored?




When i create empty original disk image on btrfs. I do like :

btrfs sub create /mnt/test/test_disk
chattr -R +C /mnt/test/test_disk
fallocate -l 50G /mnt/test/test_disk/master.img

then do fdisk things for partitioning image.
And the file system of disk image is ntfs. all clients are Windows.

i create snapshots from original subvolume when clients boot up using
'btrfs sub snap'.
The reason i stored disk image in subvolume is that subvolume way is
faster than 'cp --reflink' and i needed to disable cow, so 'cp
--reflink' became unavailable anyway.


I don't know what it is, but there's something almost pathological
with NTFS on Btrfs (via either Raw image or qcow2). It's neurotic
levels of fragmentation.
It's Window's write patterns in NTFS that are the issue, the same 
problem shows up using LVM thinp snapshots, you just can't see it as 
readily because LVM hides more from the user than BTRFS does.  NTFS by 
itself runs fine in this situation (I've actually tested this with 
Linux), and is no worse than most other filesystems in that respect. 
FWIW, it's not quite as bad with current builds of Windows 10, and it's 
also a bit better if Windows thinks you're on non-rotational media.


While an individual image is nocow, it becomes cow due to all the
snapshots you're creating, so the fragmentation is going to be really
bad. And then upon snapshot deletition all of those reference counts
have to be individually accounted for, a thousand snapshots times
thousands of new extents. I suspect it's the cleanup accounting that's
really killing the performance.

And of course nocow also means nodatasum, so there's no checksumming
for these images.


 Thanks for your answer. Actually i have been trying almost every ways
for this project.
LVM thin pool is one of them. I tried zfs on linux, too. As you
mentioned, when metadata is full, entire lvm pool become unrepairable.
So i increased size of metadata LV of thin pool to 1 percent of thin
pool. And that problem was gone.


Good to know.


Anyway, if lvm is better option than btrfs for my purpose, what about zfs?


ZFS supports block devices presented via iSCSI so there's no need for
an image file at all, and it's more mature. But there is no nocow
option, and I suspect there's going to be as much fragmentation as
with Btrfs but maybe not.
Strictly speaking, ZFS supports exposing parts of the storage pool as 
block devices, which may then be exposed however you want.  I know a 
couple of people who use it with ATAoE instead of iSCSI, and it works 
just as well with NBD too.




 So you're saying i need to re-consider using btrfs and look for other
options like lvm thin pool. I think it makes sense.
i have two more questions.

1. If i move to lvm from btrfs, what about mdadm chunk size?
I am still not sure what is the best chunk size for numerous cloned disks.
And any recommends options of LVM thin?


You'd have to benchmark it. mdadm default 512KiB which works well for
some use cases but not others. And the LVM chunksize (for snapshots)
defaults to 64KiB which works well for some use cases but not others.
There are lots of levers here.
Ideally, if you're using LVM snapshots on top of MD-RAID, you should 
match chunk sizes.  In this particular case, I'd probably start with 
256k chunks on both and see how that does, and then adjust from there.


I just thought of something though which is thin LV snapshots can't
have their size limited. If you start with a 100GiB LV, each snapshot
is 100GiB. So any wayward process in any, or all, of these 1000s of
snapshots, could bring down the entire storage stack by consuming too
much of the pool at once. So it's not exactly true that each LV is
completely isolated from the others.
The same is true of any thinly provisioned storage stack though.  It's 
an inherent risk in that configuration.





2. What about zfs on linux? I think zol is similar with lvm in some ways.


I haven't used it for anything like this use case, but it's a full
blown file system which LVM is not. Sometimes simpler is better. All
you really need here is a logical block device that you can snapshot,
the actual file system of concern is NTFS which can of course exist
directly on an LV - no disk image needed. Using LVM, other than NTFS
fragmentation itself, you have no additional fragmentation of any
underlying file system since there isn't one. And LVM snapshot
deletions should 

Re: Any suggestions for thousands of disk image snapshots ?

2016-07-26 Thread Chris Murphy
On Tue, Jul 26, 2016 at 3:37 AM, Kurt Seo  wrote:
> 2016-07-26 5:49 GMT+09:00 Chris Murphy :
>> On Mon, Jul 25, 2016 at 1:25 AM, Kurt Seo  
>> wrote:
>>>  Hi all
>>>
>>>
>>>  I am currently running a project for building servers with btrfs.
>>> Purposes of servers are exporting disk images through iscsi targets
>>> and disk images are generated from btrfs subvolume snapshot.
>>
>> How is the disk image generated from Btrfs subvolume snapshot?
>>
>> On what file system is the disk image stored?
>>
>>
>
> When i create empty original disk image on btrfs. I do like :
>
> btrfs sub create /mnt/test/test_disk
> chattr -R +C /mnt/test/test_disk
> fallocate -l 50G /mnt/test/test_disk/master.img
>
> then do fdisk things for partitioning image.
> And the file system of disk image is ntfs. all clients are Windows.
>
> i create snapshots from original subvolume when clients boot up using
> 'btrfs sub snap'.
> The reason i stored disk image in subvolume is that subvolume way is
> faster than 'cp --reflink' and i needed to disable cow, so 'cp
> --reflink' became unavailable anyway.

I don't know what it is, but there's something almost pathological
with NTFS on Btrfs (via either Raw image or qcow2). It's neurotic
levels of fragmentation.

While an individual image is nocow, it becomes cow due to all the
snapshots you're creating, so the fragmentation is going to be really
bad. And then upon snapshot deletition all of those reference counts
have to be individually accounted for, a thousand snapshots times
thousands of new extents. I suspect it's the cleanup accounting that's
really killing the performance.

And of course nocow also means nodatasum, so there's no checksumming
for these images.




>  Thanks for your answer. Actually i have been trying almost every ways
> for this project.
> LVM thin pool is one of them. I tried zfs on linux, too. As you
> mentioned, when metadata is full, entire lvm pool become unrepairable.
> So i increased size of metadata LV of thin pool to 1 percent of thin
> pool. And that problem was gone.

Good to know.

> Anyway, if lvm is better option than btrfs for my purpose, what about zfs?

ZFS supports block devices presented via iSCSI so there's no need for
an image file at all, and it's more mature. But there is no nocow
option, and I suspect there's going to be as much fragmentation as
with Btrfs but maybe not.


>  So you're saying i need to re-consider using btrfs and look for other
> options like lvm thin pool. I think it makes sense.
> i have two more questions.
>
> 1. If i move to lvm from btrfs, what about mdadm chunk size?
> I am still not sure what is the best chunk size for numerous cloned disks.
> And any recommends options of LVM thin?

You'd have to benchmark it. mdadm default 512KiB which works well for
some use cases but not others. And the LVM chunksize (for snapshots)
defaults to 64KiB which works well for some use cases but not others.
There are lots of levers here.

I just thought of something though which is thin LV snapshots can't
have their size limited. If you start with a 100GiB LV, each snapshot
is 100GiB. So any wayward process in any, or all, of these 1000s of
snapshots, could bring down the entire storage stack by consuming too
much of the pool at once. So it's not exactly true that each LV is
completely isolated from the others.


>
> 2. What about zfs on linux? I think zol is similar with lvm in some ways.

I haven't used it for anything like this use case, but it's a full
blown file system which LVM is not. Sometimes simpler is better. All
you really need here is a logical block device that you can snapshot,
the actual file system of concern is NTFS which can of course exist
directly on an LV - no disk image needed. Using LVM, other than NTFS
fragmentation itself, you have no additional fragmentation of any
underlying file system since there isn't one. And LVM snapshot
deletions should be pretty fast.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any suggestions for thousands of disk image snapshots ?

2016-07-26 Thread Kurt Seo
2016-07-26 5:49 GMT+09:00 Chris Murphy :
> On Mon, Jul 25, 2016 at 1:25 AM, Kurt Seo  
> wrote:
>>  Hi all
>>
>>
>>  I am currently running a project for building servers with btrfs.
>> Purposes of servers are exporting disk images through iscsi targets
>> and disk images are generated from btrfs subvolume snapshot.
>
> How is the disk image generated from Btrfs subvolume snapshot?
>
> On what file system is the disk image stored?
>
>

When i create empty original disk image on btrfs. I do like :

btrfs sub create /mnt/test/test_disk
chattr -R +C /mnt/test/test_disk
fallocate -l 50G /mnt/test/test_disk/master.img

then do fdisk things for partitioning image.
And the file system of disk image is ntfs. all clients are Windows.

i create snapshots from original subvolume when clients boot up using
'btrfs sub snap'.
The reason i stored disk image in subvolume is that subvolume way is
faster than 'cp --reflink' and i needed to disable cow, so 'cp
--reflink' became unavailable anyway.







>> Maximum number of clients is 500 and each client uses two snapshots of
>> disk images. the first disk image's size is about 50GB and second one
>> is about 1.5TB.
>> Important thing is that the original 1.5TB disk image is mounted with
>> loop device and modified real time - eg. continuously downloading
>> torrents in it.
>> snapshots are made when clients boot up and deleted when they turned off.
>>
>> So server has two original disk images and about a thousand of
>> snapshots in total.
>> I made a list of factors affect server's performance and stability.
>>
>> 1. Raid Configuration - Mdadm raid vs btrfs raid, configuration and
>> options for them.
>> 2. How to format btrfs - nodesize, features
>> 3. Mount options - nodatacow and compression things.
>> 4. Kernel parameter tuning.
>> 5. Hardware specification.
>>
>>
>> My current setups are
>>
>> 1. mdadm raid10 with 1024k chunk and 12 disks of 512GB ssd.
>> 2. nodesize 32k and nothing else.
>> 3. nodatacow, noatime, nodiratime, nospace_cache, ssd, compress=lzo
>> 4. Ubuntu with 4.1.27 kernel without additional configurations.
>> 5.
>> CPU : Xeon E3- 1225v2 Quad Core 3.2Ghz
>> RAM : 2 x DDR3 8GB ECC  ( total 16GB)
>> NIC : 2 x 10Gbe
>>
>>
>>  The result of test so far is
>>
>> 1. btrfs-transaction and btrfs-cleaner assume cpu regularly.
>> 2. When cpu is busy for those processes, creating snapshots takes long.
>> 3. The performance is getting slow as time goes by.
>>
>>
>> So if there are any wrong and missing configurations , can you suggest some?
>> like i need to increase physical memory.
>>
>> Any idea would help me a lot.
>
> Off hand it sounds like you have a file system inside a disk image
> which itself is stored on a file system. So there's two file systems.
> And somehow you have to create the disk image from a subvolume, which
> isn't going to be very fast. And also something I read recently on the
> XFS list makes me wonder if loop devices are production worthy.
>
> I'd reconsider the layout for any one of these reasons alone.
>
>
> 1. mdadm raid10 + LVM thinp + either XFS or Btrfs. The first LV you
> create is the one the host is constantly updating. You can use XFS
> freeze to freeze the file system, take the snapshot, and then release
> the freeze. You now have the original LV which is still being updated
> by the host, but you have a 2nd LV that itself can be exported as an
> iSCSI target to a client system. There's no need to create a disk
> image, so the creation of the snapshot and iSCSI target is much
> faster.
>



> 2. Similar to the above, but you could make the 2nd LV (the snapshot)
> a Btrfs seed device that all of the clients share, and they are each
> pointed to their own additional LV used for the Btrfs sprout device.
>
> The issue I had a year ago with LVM thin provisioning is when the
> metadata pool gets full, the entire VG implodes very badly and I
> didn't get any sufficient warnings in advance that the setup was
> suboptimal, or that it was about to run out of metadata space, and it
> wasn't repairable. But it was just a test. I haven't substantially
> played with LVM thinp with more than a dozen snapshots. But LVM being
> like emacs, you've got a lot of levers to adjust things depending on
> the workload whereas Btrfs has very few.
>
> Therefore, the plus of the 2nd option is you're only using a handful
> of LVM thinp snapshots. And you're also not really using Btrfs
> snapshots either, you're using the union-like fs feature of the
> seed-sprout capability of Btrfs. The other nice thing is that when the
> clients quit, the LV's are removed at an LVM extent level. There is no
> need for file system cleanup processes to decrement reference counts
> on all the affected extents. So it'd be fast.
>

>
> --
> Chris Murphy


 Thanks for your answer. Actually i have been trying almost every ways
for this project.
LVM thin pool is one of them. I tried zfs on linux, too. As you
mentioned, 

Re: Any suggestions for thousands of disk image snapshots ?

2016-07-25 Thread Chris Murphy
On Mon, Jul 25, 2016 at 2:49 PM, Chris Murphy  wrote:

> 1. mdadm raid10 + LVM thinp + either XFS or Btrfs.

You probably want to use LVM thin provisioning because its snapshot
implementation is much faster and more hands off than thick
provisioning.

And you'll need to use mdadm raid10 instead of LVM raid10 because
thinp doesn't support RAID.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any suggestions for thousands of disk image snapshots ?

2016-07-25 Thread Chris Murphy
On Mon, Jul 25, 2016 at 1:25 AM, Kurt Seo  wrote:
>  Hi all
>
>
>  I am currently running a project for building servers with btrfs.
> Purposes of servers are exporting disk images through iscsi targets
> and disk images are generated from btrfs subvolume snapshot.

How is the disk image generated from Btrfs subvolume snapshot?

On what file system is the disk image stored?


> Maximum number of clients is 500 and each client uses two snapshots of
> disk images. the first disk image's size is about 50GB and second one
> is about 1.5TB.
> Important thing is that the original 1.5TB disk image is mounted with
> loop device and modified real time - eg. continuously downloading
> torrents in it.
> snapshots are made when clients boot up and deleted when they turned off.
>
> So server has two original disk images and about a thousand of
> snapshots in total.
> I made a list of factors affect server's performance and stability.
>
> 1. Raid Configuration - Mdadm raid vs btrfs raid, configuration and
> options for them.
> 2. How to format btrfs - nodesize, features
> 3. Mount options - nodatacow and compression things.
> 4. Kernel parameter tuning.
> 5. Hardware specification.
>
>
> My current setups are
>
> 1. mdadm raid10 with 1024k chunk and 12 disks of 512GB ssd.
> 2. nodesize 32k and nothing else.
> 3. nodatacow, noatime, nodiratime, nospace_cache, ssd, compress=lzo
> 4. Ubuntu with 4.1.27 kernel without additional configurations.
> 5.
> CPU : Xeon E3- 1225v2 Quad Core 3.2Ghz
> RAM : 2 x DDR3 8GB ECC  ( total 16GB)
> NIC : 2 x 10Gbe
>
>
>  The result of test so far is
>
> 1. btrfs-transaction and btrfs-cleaner assume cpu regularly.
> 2. When cpu is busy for those processes, creating snapshots takes long.
> 3. The performance is getting slow as time goes by.
>
>
> So if there are any wrong and missing configurations , can you suggest some?
> like i need to increase physical memory.
>
> Any idea would help me a lot.

Off hand it sounds like you have a file system inside a disk image
which itself is stored on a file system. So there's two file systems.
And somehow you have to create the disk image from a subvolume, which
isn't going to be very fast. And also something I read recently on the
XFS list makes me wonder if loop devices are production worthy.

I'd reconsider the layout for any one of these reasons alone.


1. mdadm raid10 + LVM thinp + either XFS or Btrfs. The first LV you
create is the one the host is constantly updating. You can use XFS
freeze to freeze the file system, take the snapshot, and then release
the freeze. You now have the original LV which is still being updated
by the host, but you have a 2nd LV that itself can be exported as an
iSCSI target to a client system. There's no need to create a disk
image, so the creation of the snapshot and iSCSI target is much
faster.

2. Similar to the above, but you could make the 2nd LV (the snapshot)
a Btrfs seed device that all of the clients share, and they are each
pointed to their own additional LV used for the Btrfs sprout device.

The issue I had a year ago with LVM thin provisioning is when the
metadata pool gets full, the entire VG implodes very badly and I
didn't get any sufficient warnings in advance that the setup was
suboptimal, or that it was about to run out of metadata space, and it
wasn't repairable. But it was just a test. I haven't substantially
played with LVM thinp with more than a dozen snapshots. But LVM being
like emacs, you've got a lot of levers to adjust things depending on
the workload whereas Btrfs has very few.

Therefore, the plus of the 2nd option is you're only using a handful
of LVM thinp snapshots. And you're also not really using Btrfs
snapshots either, you're using the union-like fs feature of the
seed-sprout capability of Btrfs. The other nice thing is that when the
clients quit, the LV's are removed at an LVM extent level. There is no
need for file system cleanup processes to decrement reference counts
on all the affected extents. So it'd be fast.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Any suggestions for thousands of disk image snapshots ?

2016-07-25 Thread Kurt Seo
 Hi all


 I am currently running a project for building servers with btrfs.
Purposes of servers are exporting disk images through iscsi targets
and disk images are generated from btrfs subvolume snapshot.
Maximum number of clients is 500 and each client uses two snapshots of
disk images. the first disk image's size is about 50GB and second one
is about 1.5TB.
Important thing is that the original 1.5TB disk image is mounted with
loop device and modified real time - eg. continuously downloading
torrents in it.
snapshots are made when clients boot up and deleted when they turned off.

So server has two original disk images and about a thousand of
snapshots in total.
I made a list of factors affect server's performance and stability.

1. Raid Configuration - Mdadm raid vs btrfs raid, configuration and
options for them.
2. How to format btrfs - nodesize, features
3. Mount options - nodatacow and compression things.
4. Kernel parameter tuning.
5. Hardware specification.


My current setups are

1. mdadm raid10 with 1024k chunk and 12 disks of 512GB ssd.
2. nodesize 32k and nothing else.
3. nodatacow, noatime, nodiratime, nospace_cache, ssd, compress=lzo
4. Ubuntu with 4.1.27 kernel without additional configurations.
5.
CPU : Xeon E3- 1225v2 Quad Core 3.2Ghz
RAM : 2 x DDR3 8GB ECC  ( total 16GB)
NIC : 2 x 10Gbe


 The result of test so far is

1. btrfs-transaction and btrfs-cleaner assume cpu regularly.
2. When cpu is busy for those processes, creating snapshots takes long.
3. The performance is getting slow as time goes by.


So if there are any wrong and missing configurations , can you suggest some?
like i need to increase physical memory.

Any idea would help me a lot.

Thank you


Seo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html