Hi,
I like your idea and implementation for offline deduplication a lot. I
think it will save me 50% of my backup storage!
Your code walks/scans the directory/file tree of the filesystem. Would
it be possible to walk/scan the disk extents sequentially in disk
order?
- This would be more I/O-effi
On Mon, Jan 10, 2011 at 10:39:56AM -0500, Chris Mason wrote:
> Excerpts from Josef Bacik's message of 2011-01-10 10:37:31 -0500:
> > On Mon, Jan 10, 2011 at 10:28:14AM -0500, Ric Wheeler wrote:
> > >
> > > I think that dedup has a variety of use cases that are all very dependent
> > > on your work
Excerpts from Josef Bacik's message of 2011-01-10 10:37:31 -0500:
> On Mon, Jan 10, 2011 at 10:28:14AM -0500, Ric Wheeler wrote:
> >
> > I think that dedup has a variety of use cases that are all very dependent
> > on your workload. The approach you have here seems to be a quite
> > reasonable on
On Mon, Jan 10, 2011 at 10:28:14AM -0500, Ric Wheeler wrote:
>
> I think that dedup has a variety of use cases that are all very dependent
> on your workload. The approach you have here seems to be a quite
> reasonable one.
>
> I did not see it in the code, but it is great to be able to collect
I think that dedup has a variety of use cases that are all very dependent on
your workload. The approach you have here seems to be a quite reasonable one.
I did not see it in the code, but it is great to be able to collect statistics
on how effective your hash is and any counters for the extr
On Thursday, January 06, 2011 01:35:15 pm Chris Mason wrote:
> What is the smallest granularity that the datadomain searches for in
> terms of dedup?
>
> Josef's current setup isn't restricted to a specific block size, but
> there is a min match of 4k.
I talked to a few people I know and didn't ge
Excerpts from Peter A's message of 2011-01-05 22:58:36 -0500:
> On Wednesday, January 05, 2011 08:19:04 pm Spelic wrote:
> > > I'd just make it always use the fs block size. No point in making it
> > > variable.
> >
> > Agreed. What is the reason for variable block size?
>
> First post on this l
Just a quick update, I've dropped the hashing stuff in favor of doing a memcmp
in the kernel to make sure the data is still the same. The thing that takes a
while is reading the data up from disk, so doing a memcmp of the entire buffer
isn't that big of a deal, not to mention there's a possiblity
On Thursday 06 of January 2011 10:51:04 Mike Hommey wrote:
> On Thu, Jan 06, 2011 at 10:37:46AM +0100, Tomasz Chmielewski wrote:
> > >I have been thinking a lot about de-duplication for a backup application
> > >I am writing. I wrote a little script to figure out how much it would
> > >save me. For
On Thursday, January 06, 2011 10:07:03 am you wrote:
> I'd be interested to see the evidence of the "variable length" argument.
> I have a sneaky suspicion that it actually falls back to 512 byte
> blocks, which are much more likely to align, when more sensibly sized
> blocks fail. The downside is
On Thu, Jan 06, 2011 at 02:41:28PM +, Gordan Bobic wrote:
> Ondřej Bílka wrote:
>
> >>>Then again, for a lot of use-cases there are perhaps better ways to
> >>>achieve the targed goal than deduping on FS level, e.g. snapshotting or
> >>>something like fl-cow:
> >>>http://www.xmailserver.org/fl
Peter A wrote:
On Thursday, January 06, 2011 09:00:47 am you wrote:
Peter A wrote:
I'm saying in a filesystem it doesn't matter - if you bundle everything
into a backup stream, it does. Think of tar. 512 byte allignment. I tar
up a directory with 8TB total size. No big deal. Now I create a new,
On Thursday, January 06, 2011 09:00:47 am you wrote:
> Peter A wrote:
> > I'm saying in a filesystem it doesn't matter - if you bundle everything
> > into a backup stream, it does. Think of tar. 512 byte allignment. I tar
> > up a directory with 8TB total size. No big deal. Now I create a new,
> >
Tomasz Torcz wrote:
On Thu, Jan 06, 2011 at 02:19:04AM +0100, Spelic wrote:
CPU can handle considerably more than 250 block hashings per
second. You could argue that this changes in cases of sequential
I/O on big files, but a 1.86GHz GHz Core2 can churn through
111MB/s of SHA256, which even SSDs
Ondřej Bílka wrote:
Then again, for a lot of use-cases there are perhaps better ways to
achieve the targed goal than deduping on FS level, e.g. snapshotting or
something like fl-cow:
http://www.xmailserver.org/flcow.html
As VM are concerned fl-cow is poor replacement of deduping.
Depends on
On Thu, Jan 06, 2011 at 12:18:34PM +, Simon Farnsworth wrote:
> Gordan Bobic wrote:
>
> > Josef Bacik wrote:
> >
snip
>
> > Then again, for a lot of use-cases there are perhaps better ways to
> > achieve the targed goal than deduping on FS level, e.g. snapshotting or
> > something like fl-c
On Thu, Jan 06, 2011 at 02:19:04AM +0100, Spelic wrote:
> >CPU can handle considerably more than 250 block hashings per
> >second. You could argue that this changes in cases of sequential
> >I/O on big files, but a 1.86GHz GHz Core2 can churn through
> >111MB/s of SHA256, which even SSDs will strug
Peter A wrote:
On Thursday, January 06, 2011 05:48:18 am you wrote:
Can you elaborate what you're talking about here? How does the length of
a directory name affect alignment of file block contents? I don't see
how variability of length matters, other than to make things a lot more
complicated.
On Thursday, January 06, 2011 05:48:18 am you wrote:
> Can you elaborate what you're talking about here? How does the length of
> a directory name affect alignment of file block contents? I don't see
> how variability of length matters, other than to make things a lot more
> complicated.
I'm saying
Gordan Bobic wrote:
> Simon Farnsworth wrote:
>
>> The basic idea is to use fanotify/inotify (whichever of the notification
>> systems works for this) to track which inodes have been written to. It
>> can then mmap() the changed data (before it's been dropped from RAM) and
>> do the same process
Simon Farnsworth wrote:
The basic idea is to use fanotify/inotify (whichever of the notification
systems works for this) to track which inodes have been written to. It can
then mmap() the changed data (before it's been dropped from RAM) and do the
same process as an offline dedupe (hash, check
Gordan Bobic wrote:
> Josef Bacik wrote:
>
>> Basically I think online dedup is huge waste of time and completely
>> useless.
>
> I couldn't disagree more. First, let's consider what is the
> general-purpose use-case of data deduplication. What are the resource
> requirements to perform it? How
Tomasz Chmielewski wrote:
I have been thinking a lot about de-duplication for a backup application
I am writing. I wrote a little script to figure out how much it would
save me. For my laptop home directory, about 100 GiB of data, it was a
couple of percent, depending a bit on the size of the chu
Peter A wrote:
On Wednesday, January 05, 2011 08:19:04 pm Spelic wrote:
I'd just make it always use the fs block size. No point in making it
variable.
Agreed. What is the reason for variable block size?
First post on this list - I mostly was just reading so far to learn more on fs
design but
Spelic wrote:
On 01/06/2011 02:03 AM, Gordan Bobic wrote:
That's just alarmist. AES is being cryptanalyzed because everything
uses it. And the news of it's insecurity are somewhat exaggerated (for
now at least).
Who cares... the fact of not being much used is a benefit for RIPEMD /
blowfi
On Thu, Jan 06, 2011 at 10:37:46AM +0100, Tomasz Chmielewski wrote:
> >I have been thinking a lot about de-duplication for a backup application
> >I am writing. I wrote a little script to figure out how much it would
> >save me. For my laptop home directory, about 100 GiB of data, it was a
> >coupl
Chris Mason wrote:
Excerpts from Gordan Bobic's message of 2011-01-05 12:42:42 -0500:
Josef Bacik wrote:
Basically I think online dedup is huge waste of time and completely useless.
I couldn't disagree more. First, let's consider what is the
general-purpose use-case of data deduplication. Wha
I have been thinking a lot about de-duplication for a backup application
I am writing. I wrote a little script to figure out how much it would
save me. For my laptop home directory, about 100 GiB of data, it was a
couple of percent, depending a bit on the size of the chunks. With 4 KiB
chunks, I w
On Thu, Jan 6, 2011 at 12:36 AM, Josef Bacik wrote:
> Here are patches to do offline deduplication for Btrfs. It works well for the
> cases it's expected to, I'm looking for feedback on the ioctl interface and
> such, I'm well aware there are missing features for the use
On Wednesday, January 05, 2011 08:19:04 pm Spelic wrote:
> > I'd just make it always use the fs block size. No point in making it
> > variable.
>
> Agreed. What is the reason for variable block size?
First post on this list - I mostly was just reading so far to learn more on fs
design but this
On Wed, Jan 5, 2011 at 5:03 PM, Gordan Bobic wrote:
> On 01/06/2011 12:22 AM, Spelic wrote:
> Definitely agree that it should be a per-directory option, rather than per
> mount.
JOOC, would the dedupe "table" be done per directory, per mount, per
sub-volume, or per volume? The larger the pool of
On 01/06/2011 02:03 AM, Gordan Bobic wrote:
That's just alarmist. AES is being cryptanalyzed because everything
uses it. And the news of it's insecurity are somewhat exaggerated (for
now at least).
Who cares... the fact of not being much used is a benefit for RIPEMD /
blowfish-twofish then
Excerpts from Gordan Bobic's message of 2011-01-05 12:42:42 -0500:
> Josef Bacik wrote:
>
> > Basically I think online dedup is huge waste of time and completely useless.
>
> I couldn't disagree more. First, let's consider what is the
> general-purpose use-case of data deduplication. What are th
On 01/05/2011 09:46 PM, Gordan Bobic wrote:
On 01/05/2011 07:46 PM, Josef Bacik wrote:
Offline dedup is more expensive - so why are you of the opinion that
it is less silly? And comparison by silliness quotiend still sounds
like an argument over which is better.
If I can say my opinion, I
On 01/06/2011 12:22 AM, Spelic wrote:
On 01/05/2011 09:46 PM, Gordan Bobic wrote:
On 01/05/2011 07:46 PM, Josef Bacik wrote:
Offline dedup is more expensive - so why are you of the opinion that
it is less silly? And comparison by silliness quotiend still sounds
like an argument over which is be
On 01/05/2011 09:14 PM, Diego Calleja wrote:
In fact, there are cases where online dedup is clearly much worse. For
example, cases where people suffer duplication, but it takes a lot of
time (several months) to hit it. With online dedup, you need to enable
it all the time to get deduplication, an
On ke, 2011-01-05 at 19:58 +, Lars Wirzenius wrote:
> (For my script, see find-duplicate-chunks in
> http://code.liw.fi/debian/pool/main/o/obnam/obnam_0.14.tar.gz or get the
> current code using "bzr get http://code.liw.fi/obnam/bzr/trunk/";.
> http://braawi.org/obnam/ is the home page of the b
On 01/05/2011 07:46 PM, Josef Bacik wrote:
Blah blah blah, I'm not having an argument about which is better because I
simply do not care. I think dedup is silly to begin with, and online dedup even
sillier.
Offline dedup is more expensive - so why are you of the opinion that it
is less silly
On Wed, Jan 5, 2011 at 12:15 PM, Josef Bacik wrote:
> Yeah for things where you are talking about sending it over the network or
> something like that every little bit helps. I think deduplication is far more
> interesting and usefull at an application level than at a filesystem level.
> For
>
On Wed, Jan 05, 2011 at 11:01:41AM -0800, Ray Van Dolson wrote:
> On Wed, Jan 05, 2011 at 07:41:13PM +0100, Diego Calleja wrote:
> > On Miércoles, 5 de Enero de 2011 18:42:42 Gordan Bobic escribió:
> > > So by doing the hash indexing offline, the total amount of disk I/O
> > > required effectively
On 01/05/2011 07:01 PM, Ray Van Dolson wrote:
On Wed, Jan 05, 2011 at 07:41:13PM +0100, Diego Calleja wrote:
On Miércoles, 5 de Enero de 2011 18:42:42 Gordan Bobic escribió:
So by doing the hash indexing offline, the total amount of disk I/O
required effectively doubles, and the amount of CPU s
On 01/05/2011 06:41 PM, Diego Calleja wrote:
On Miércoles, 5 de Enero de 2011 18:42:42 Gordan Bobic escribió:
So by doing the hash indexing offline, the total amount of disk I/O
required effectively doubles, and the amount of CPU spent on doing the
hashing is in no way reduced.
But there are p
On Wed, Jan 05, 2011 at 07:58:13PM +, Lars Wirzenius wrote:
> On ke, 2011-01-05 at 14:46 -0500, Josef Bacik wrote:
> > Blah blah blah, I'm not having an argument about which is better because I
> > simply do not care. I think dedup is silly to begin with, and online dedup
> > even
> > sillier
On Wed, Jan 5, 2011 at 11:46 AM, Josef Bacik wrote:
> Dedup is only usefull if you _know_ you are going to have duplicate
> information,
> so the two major usecases that come to mind are
>
> 1) Mail server. You have small files, probably less than 4k (blocksize) that
> you are storing hundreds t
On ke, 2011-01-05 at 14:46 -0500, Josef Bacik wrote:
> Blah blah blah, I'm not having an argument about which is better because I
> simply do not care. I think dedup is silly to begin with, and online dedup
> even
> sillier. The only reason I did offline dedup was because I was just toying
> aro
On Wed, Jan 05, 2011 at 07:41:13PM +0100, Diego Calleja wrote:
> On Miércoles, 5 de Enero de 2011 18:42:42 Gordan Bobic escribió:
> > So by doing the hash indexing offline, the total amount of disk I/O
> > required effectively doubles, and the amount of CPU spent on doing the
> > hashing is in no
On Wed, Jan 05, 2011 at 05:42:42PM +, Gordan Bobic wrote:
> Josef Bacik wrote:
>
>> Basically I think online dedup is huge waste of time and completely useless.
>
> I couldn't disagree more. First, let's consider what is the
> general-purpose use-case of data deduplication. What are the resou
Josef Bacik wrote:
Basically I think online dedup is huge waste of time and completely useless.
I couldn't disagree more. First, let's consider what is the
general-purpose use-case of data deduplication. What are the resource
requirements to perform it? How do these resource requirements dif
Here are patches to do offline deduplication for Btrfs. It works well for the
cases it's expected to, I'm looking for feedback on the ioctl interface and
such, I'm well aware there are missing features for the userspace app (like
being able to set a different blocksize). If th
49 matches
Mail list logo