Re: Tiered storage?

2017-11-16 Thread Kai Krakow
Am Wed, 15 Nov 2017 08:11:04 +0100
schrieb waxhead :

> As for dedupe there is (to my knowledge) nothing fully automatic yet. 
> You have to run a program to scan your filesystem but all the 
> deduplication is done in the kernel.
> duperemove works apparently quite well when I tested it, but there
> may be some performance implications.

There's bees as near-line deduplication tool, that is it watches for
generation changes in the filesystem and walks the inodes. It only
looks at extents, not at files. Deduplication itself is then delegated
to the kernel which ensures all changes are data-safe. The process is
running as a daemon and processes your changes in realtime (delayed by
a few seconds to minutes of course, due to transaction commit and
hashing phase).

You need to dedicate it part of your RAM to work, around 1 GB is
usually sufficient to work well enough. The RAM will be locked and
cannot be swapped out, so you should have a sufficiently equipped
system.

Works very well here (2TB of data, 1GB hash table, 16GB RAM).
New dDuplicated files are picked up within seconds, scanned (hitting
the cache most of the time thus not requiring physical IO), and then
submitted to the kernel for deduplication.

I'd call that fully automatic: Once set up, it just works, and works
well. Performance impact is very low once the initial scan is done.

https://github.com/Zygo/bees


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Tiered storage?

2017-11-15 Thread Duncan
Roy Sigurd Karlsbakk posted on Wed, 15 Nov 2017 15:10:08 +0100 as
excerpted:

>>> As for dedupe there is (to my knowledge) nothing fully automatic yet.
>>> You have to run a program to scan your filesystem but all the
>>> deduplication is done in the kernel.
>>> duperemove works apparently quite well when I tested it, but there may
>>> be some performance implications.

>> Correct, there is nothing automatic (and there are pretty significant
>> arguments against doing automatic deduplication in most cases), but the
>> off-line options (via the EXTENT_SAME ioctl) are reasonably reliable.
>> Duperemove in particular does a good job, though it may take a long
>> time for large data sets.
>> 
>> As far as performance, it's no worse than large numbers of snapshots.
>> The issues arise from using very large numbers of reflinks.
> 
> What is this "large" number of snapshots? Not that it's directly
> comparible, but I've worked with ZFS a while, and haven't seen those
> issues there.

Btrfs has scaling issues with reflinks, not so much in normal operation, 
but when it comes to filesystem maintenance such as btrfs check and btrfs 
balance.

Numerically, low double-digits of reflinks per extent seems to be 
reasonably fine, high double-digits to low triple-digits begins to run 
into scaling issues, and high triple digits to over 1000... better be 
prepared to wait awhile (can be days or weeks!) for that balance or check 
to complete, and check requires LOTS more memory as well, particularly at 
TB+ scale.

Of course snapshots are the common instance of reflinking, and each 
snapshot is another reflink to each extent of the data in the subvolume 
it covers, so limiting snapshots to 10-50 of each subvolume is 
recommended, and limiting to under 250-ish is STRONGLY recommended.  
(Total number of snapshots per filesystem, where there's many subvolumes 
and snapshots per subvolume falls within the above limits, doesn't seem 
to be a problem.)

Dedupe uses reflinking too, but the effects can be much more variable 
depending on the use-case and how many actual reflinks are being created.

A single extent with 1000 deduping reflinks, as might be common in a 
commercial/hosting use-case, shouldn't be too bad, perhaps comparable to 
a single snapshot, but obviously, do that with a bunch of extents (as a 
hosting use-case might) and it quickly builds to the effect of 1000 
snapshots of the same subvolume, which as mentioned above puts 
maintenance-task time out of the realm of reasonable, for many.

Tho of course in a commercial/hosting case maintenance may well not be 
done as a simple swap-in of a fresh backup is more likely, so it may not 
matter for that scenario.

OTOH, a typical individual/personal use-case may dedup many files but 
only single-digit times each, so the effect would be the same as a single-
digit number of snapshots at worst.

Meanwhile, while btrfs quotas are finally maturing in terms of actually 
tracking the numbers correctly, their effect on scaling is pretty bad 
too.  The recommendation is to keep btrfs quotas off unless you actually 
need them.  If you do need quotas, temporarily disable them while doing 
balances and device-removes (which do implicit balances), then quota-
rescan after the balance is done, because precisely tracking quotas thru 
a balance ends up repeatedly recalculating the numbers again and again 
during the balance, and that just doesn't scale.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Tiered storage?

2017-11-15 Thread Roy Sigurd Karlsbakk
>> As for dedupe there is (to my knowledge) nothing fully automatic yet.
>> You have to run a program to scan your filesystem but all the
>> deduplication is done in the kernel.
>> duperemove works apparently quite well when I tested it, but there may
>> be some performance implications.
> Correct, there is nothing automatic (and there are pretty significant
> arguments against doing automatic deduplication in most cases), but the
> off-line options (via the EXTENT_SAME ioctl) are reasonably reliable.
> Duperemove in particular does a good job, though it may take a long time
> for large data sets.
> 
> As far as performance, it's no worse than large numbers of snapshots.
> The issues arise from using very large numbers of reflinks.

What is this "large" number of snapshots? Not that it's directly comparible, 
but I've worked with ZFS a while, and haven't seen those issues there.

Vennlig hilsen

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
Hið góða skaltu í stein höggva, hið illa í snjó rita.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Tiered storage?

2017-11-15 Thread Austin S. Hemmelgarn

On 2017-11-15 02:11, waxhead wrote:
As a regular BTRFS user I can tell you that there is no such thing as 
hot data tracking yet. Some people seem to use bcache together with 
btrfs and come asking for help on the mailing list.
Bcache works fine recently.  It was only with older versions that there 
were issues.  dm-cache similarly works fine on recent versions.  In both 
cases though, you need to be sure you know what you're doing, otherwise 
you are liable to break things.


Raid5/6 have received a few fixes recently, and it *may* soon me worth 
trying out raid5/6 for data, but keeping metadata in raid1/10 (I would 
rather loose a file or two than the entire filesystem).

I had plans to run some tests on this a while ago, but forgot about it.
As call good citizens, remember to have good backups. Last time I tested 
for Raid5/6 I ran into issues easily. For what it's worth - raid1/10 
seems pretty rock solid as long as you have sufficient disks (hint: you 
need more than two for raid1 if you want to stay safe)
Parity profiles (raid5 and raid6) still have issues, although there are 
fewer than there were, with most of the remaining issues surrounding 
recovery.  I would still recommend against it for production usage.


Simple replication (raid1) is pretty much rock solid as long as you keep 
on top of replacing failing hardware and aren't stupid enough to run the 
array degraded for any extended period of time (converting to a single 
device volume instead of leaving things with half a volume is vastly 
preferred for multiple reasons).


Striped replication (raid10) is generally fine, but you can get much 
better performance by running BTRFS with a raid1 profile on top of two 
MD/LVM/Hardware RAID0 volumes (BTRFS still doesn't do a very good job of 
parallelizing things).


As for dedupe there is (to my knowledge) nothing fully automatic yet. 
You have to run a program to scan your filesystem but all the 
deduplication is done in the kernel.
duperemove works apparently quite well when I tested it, but there may 
be some performance implications.
Correct, there is nothing automatic (and there are pretty significant 
arguments against doing automatic deduplication in most cases), but the 
off-line options (via the EXTENT_SAME ioctl) are reasonably reliable. 
Duperemove in particular does a good job, though it may take a long time 
for large data sets.


As far as performance, it's no worse than large numbers of snapshots. 
The issues arise from using very large numbers of reflinks.


Roy Sigurd Karlsbakk wrote:

Hi all

I've been following this project on and off for quite a few years, and 
I wonder if anyone has looked into tiered storage on it. With tiered 
storage, I mean hot data lying on fast storage and cold data on slow 
storage. I'm not talking about cashing (where you just keep a copy of 
the hot data on the fast storage).


And btw, how far is raid[56] and block-level dedup from something 
useful in production?


Vennlig hilsen

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
Hið góða skaltu í stein höggva, hið illa í snjó rita.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Tiered storage?

2017-11-15 Thread Austin S. Hemmelgarn

On 2017-11-15 04:26, Marat Khalili wrote:


On 15/11/17 10:11, waxhead wrote:

hint: you need more than two for raid1 if you want to stay safe
Huh? Two is not enough? Having three or more makes a difference? (Or, 
you mean hot spare?)
They're probably referring to an issue where a two device array 
configured for raid1 which had lost a device and was mounted degraded 
and writable would generate single profile chunks on the remaining 
device instead of a half-complete raid1 chunk.  This, when combined with 
the fact that older kernels only check the filesystem as a whole for 
normal/degraded/irreparable instead of checking individual chunks would 
refuse to mount the resultant filesystem, meant that you only had one 
chance to fix such an array.


If instead you have more than two devices, regular complete raid1 
profile chunks are generated, and it becomes a non-issue.


The second issue (checking degraded status at the chunk level instead of 
volume level) has been fixed in the most recent kernels.


The first issue has not been fixed yet, but I'm pretty sure there are 
patches pending.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Tiered storage?

2017-11-15 Thread Marat Khalili


On 15/11/17 10:11, waxhead wrote:

hint: you need more than two for raid1 if you want to stay safe
Huh? Two is not enough? Having three or more makes a difference? (Or, 
you mean hot spare?)


--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Tiered storage?

2017-11-14 Thread waxhead
As a regular BTRFS user I can tell you that there is no such thing as 
hot data tracking yet. Some people seem to use bcache together with 
btrfs and come asking for help on the mailing list.


Raid5/6 have received a few fixes recently, and it *may* soon me worth 
trying out raid5/6 for data, but keeping metadata in raid1/10 (I would 
rather loose a file or two than the entire filesystem).

I had plans to run some tests on this a while ago, but forgot about it.
As call good citizens, remember to have good backups. Last time I tested 
for Raid5/6 I ran into issues easily. For what it's worth - raid1/10 
seems pretty rock solid as long as you have sufficient disks (hint: you 
need more than two for raid1 if you want to stay safe)


As for dedupe there is (to my knowledge) nothing fully automatic yet. 
You have to run a program to scan your filesystem but all the 
deduplication is done in the kernel.
duperemove works apparently quite well when I tested it, but there may 
be some performance implications.


Roy Sigurd Karlsbakk wrote:

Hi all

I've been following this project on and off for quite a few years, and I wonder 
if anyone has looked into tiered storage on it. With tiered storage, I mean hot 
data lying on fast storage and cold data on slow storage. I'm not talking about 
cashing (where you just keep a copy of the hot data on the fast storage).

And btw, how far is raid[56] and block-level dedup from something useful in 
production?

Vennlig hilsen

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
Hið góða skaltu í stein höggva, hið illa í snjó rita.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Tiered storage?

2017-11-14 Thread Roy Sigurd Karlsbakk
Hi all

I've been following this project on and off for quite a few years, and I wonder 
if anyone has looked into tiered storage on it. With tiered storage, I mean hot 
data lying on fast storage and cold data on slow storage. I'm not talking about 
cashing (where you just keep a copy of the hot data on the fast storage).

And btw, how far is raid[56] and block-level dedup from something useful in 
production?

Vennlig hilsen

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
Hið góða skaltu í stein höggva, hið illa í snjó rita.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs RAID1 woes and tiered storage

2011-01-21 Thread Hubert Kario
I've been experimenting lately with btrfs RAID1 implementation and have to say 
that it is performing quite well, but there are few problems:

* when I purposefully damage partitions on which btrfs stores data (for 
  example, by changing the case of letters) it will read the other copy and 
  return correct data. It doesn't report in dmesg this fact every time, but it  
 
  does correct the one with wrong checksum
* when both copies are damaged it returns the damaged block as it is
  written(!) and only adds a warning in the dmesg with exact same wording as 
  with the single block corruption(!!)
* from what I could find, btrfs doesn't remember anywhere the number of 
  detected and fixed corruptions

I don't know if it's the final design and while the first and last points are 
minor inconveniences the second one is quite major. At this time it doesn't 
prevent silent corruption from going unnoticed. I think that reading from such 
blocks should return EIO (unless mounted nodatasum) or at least a broadcast 
message noting that a corrupted block is being returned to userspace.



I've also been thinking about tiered storage (meaning 2+, not only two-tiered) 
and have some ideas about it.

I think that there need to be 3 different mechanisms working together to 
achieve high performance:
* ability to store all metadata on selected volumes (probably read optimised 
  SSDs)
* ability to store all newly written data on selected volumes (write optimised 
  SSDs)
* ability to differentiate between often written, often read and infrequently 
  accessed data (and based on this information, ability to move this data to 
  fast SSDs, slow SSDs, fast RAID, slow RAID or MAID)

While the first two are rather straight-forward, the third one needs some 
explanation. I think that for this to work, we should save not only the time 
of last access to file and last change time but also few past values (I think 
that at least 8 to 16 ctimes and atimes are necessary but this will need 
testing). I'm not sure about how and exactly when to move this data around to 
keep the arrays balanced but a userspace daemon would be most flexible.

This solution won't work well for file systems with few very large files of 
which very few parts change often, in other words it won't be doing block-
level tiered storage. From what I know, databases would benefit most from such 
configuration, but then most databases can already partition tables to 
different files based on access rate. As such, making its granularity on file 
level would make this mechanism easy to implement while still useful.

On second thought: it won't make it exactly file-level granular, if we 
introduce snapshots in the mix, the new version can have the data regularly 
accessed while the old snapshot won't, this way the obsolete blocks can be 
moved to slow storage.
-- 
Hubert Kario
QBS - Quality Business Software
02-656 Warszawa, ul. Ksawerów 30/85
tel. +48 (22) 646-61-51, 646-74-24
www.qbs.com.pl
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html