Re: [markfasheh/duperemove] Why blocksize is limit to 1MB?

Peter Becker Wed, 04 Jan 2017 06:46:21 -0800

Thank you for the information / clarifications. This helps me to
understand the operation somewhat better. I will continue to deal with
the subject.


Regardless of this, i will change the structure of my data in my
usecase and put on rsync --inplace --no-whole-file.

2017-01-04 13:58 GMT+01:00 Austin S. Hemmelgarn <ahferro...@gmail.com>:
> On 2017-01-03 16:35, Peter Becker wrote:
>>
>> As i understand the duperemove source-code right (i work on/ try to
>> improve this code since 5 or 6 weeks on multiple parts), duperemove
>> does hashing and calculation before they call extend_same.
>> Duperemove stores all in a hashfile and read this. after all files
>> hashed, and duplicates detected, the progress all in order without
>> reading new data form disk / hashfile. so the byte-by-byte comparison
>> of extend_same ioctl should consume the full possible bandwidth of the
>> disks.
>
> Not necessarily.  You've actually got a significant amount of processing
> between each disk operation.  General ordering inside the ioctl is:
> 1. Do generic ioctl setup.
> 2. Lock the extents.
> 3. Read the ranges into memory.
> 4. Compare the ranges.
> 5. If the ranges are identical, write out the changes needed to reflink
> them.
> 6. Unlock all the extents.
> 7. Do generic ioctl cleanup.
> 1 and 7 in particular are pretty heavy.  Ioctls were not intended to be
> called with this kind of frequency, and that fact really shows in the setup
> and teardown (overhead is way higher than a syscall).  The operation ended
> up being an ioctl instead of a syscall (or extension to another syscall)
> because:
> 1. Manipulating low-level filesystem state is part of what they're intended
> to be used for.
> 2. Introducing a new FS specific ioctl is a whole lot less controversial
> than introducing a new FS specific syscall.
>>
>>
>> 1. dbfile_load_hashes
>> 2. find_all_dupes
>> 3. dedupe_results
>> -> call the following in N threads:
>>>
>>> dedupe_extent_list
>>>>
>>>> list_for_each_entry
>>>>>
>>>>> add_extent_to_dedupe #produce a simple list/queue
>>>>> dedupe_extents
>>>>>>
>>>>>> btrfs_extent_same
>>>>>>>
>>>>>>> BTRFS_IOC_FILE_EXTENT_SAME
>>
>>
>> So if this right, one of this thinks is realy slow:
>>
>> 1. byte-per-byte comparison
>
> There's no way that this part can't be slow.  You need to load the data into
> the registers to do the comparison, you can't just point something at RAM
> and get an answer.  On x86, this in turn means that the comparison amounts
> to a loop of 2 loads followed by a compare and a branch for , repeated once
> for each range beyond the first, and that's assuming that the compiler
> optimizes it to the greatest degree possible.  On some other systems the
> compare and branch are one instruction, on others the second load might be
> eliminated, but overall it's not something that can be sped up all that
> much.
>>
>> 2. sets up the reflinks
>
> This actually is not as efficient as it sounds like it should be, adding
> reflinks means updating metadata, which means that there is some unavoidable
> overhead here.  I doubt that it's where the issue is, but I may be wrong.
>>
>> 3. unlocks the new extent
>
> There's one other aspect not listed here, locking the original extents,
> which can actually add quite a lot of overhead if the files are actually
> being used.
>>
>>
>> If i'm not wrong with my understanding of the duperemove source code,
>> this behaivor should also affected the online dedupe feature on with
>> Qu Wenruo works.
>
> AFAIK, that uses a different code path from the batch deduplication ioctl.
> It also doesn't have the context switches and other overhead from an ioctl
> involved, because it's done in kernel code.
>
>>
>> 2017-01-03 21:40 GMT+01:00 Austin S. Hemmelgarn <ahferro...@gmail.com>:
>>>
>>> On 2017-01-03 15:20, Peter Becker wrote:
>>>>
>>>>
>>>> I think i understand. The resulting keyquestion is, how i can improve
>>>> the performance of extend_same ioctl.
>>>> I tested it with following results:
>>>>
>>>> enviorment:
>>>> 2 files, called "file", size each 100GB, duperemove nofiemap-options
>>>> set, 1MB extend size.
>>>>
>>>> duperemove output:
>>>> [0x1908590] (13889/72654) Try to dedupe extents with id d1c672db
>>>> [0x1908590] Add extent for file "/mnt/new/file" at offset 66.7G (6)
>>>> [0x1908590] Add extent for file "/mnt/old/file" at offset 66.9G (7)
>>>> [0x1908590] Dedupe 1 extents (id: d1c672db) with target: (66.7G,
>>>> 1.0M), "/mnt/new/file"
>>>>
>>>> iotop output for a 30 sec. sample
>>>>
>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>                22,31    0,00   13,83        33,81    0,00       30,05
>>>>
>>>> Device:         rrqm/s   wrqm/s     r/s            w/s    rkB/s
>>>> wkB/s    avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>> sdd               0,00     1,70          1149,93    0,73  4600,53
>>>> 139,60    8,24       0,23          0,20    0,19   13,64      0,19
>>>> 21,84
>>>> sde               0,00     0,00          1149,33    0,53  4597,33
>>>> 23,87     8,04       0,20          0,18    0,18    1,75       0,18
>>>> 20,47
>>>> sdf                0,00     1,70          1149,60    0,63  4598,40
>>>> 139,60    8,24       0,21          0,18    0,18    4,63       0,18
>>>> 20,63
>>>> sdh               0,00     0,00          1149,33    0,53  4597,33
>>>> 23,87     8,04       0,21          0,18    0,18    4,25       0,18
>>>> 20,85
>>>>
>>>> resulting in less then 18MB/s read. realy slow.
>>>>
>>>> Querstion 1: why, so slow?
>>>
>>>
>>> For a couple of reasons.  First, you have to understand that duperemove
>>> itself actually does a pretty large amount of processing outside of the
>>> call
>>> to the ioctl.  It first hashes the blocks for quicker comparison and
>>> matching, then figures out which blocks match, and finally calls the
>>> ioctl
>>> on the resulting matches.  The reason for this behavior is that the ioctl
>>> is
>>> insanely slow.  It first locks the ranges passed in (so they don't get
>>> changed by anything else during the deduplication process), then does a
>>> byte-by-byte comparison to make sure they all actually do match (data
>>> safety, I've said at least once before that I think there should be a
>>> flag
>>> for the ioctl (or a separate ioctl) to skip this and assume that
>>> userspace
>>> really knows what it's doing), then finally sets up the reflinks, and
>>> unlocks the new extent.
>>>
>>> All of this ties into why I keep telling people that efficient
>>> deduplication
>>> requires a tool that understands how the data being deduplicated is
>>> structured.  By avoiding the need to hash and compare every block of
>>> data,
>>> you can significantly improve the time that part takes, and quite often
>>> this
>>> will mitigate the impact of getting a few false positives passed into the
>>> ioctl.
>>>>
>>>>
>>>> Questiont 2a: would be a higher extend-size perform better?
>>>> Querstion 2b: or did i understand something wrong?
>>>
>>>
>>> No, a larger extent would probably not help much, and that's actually a
>>> really good performance sample you've created.
>>>
>>> The block size does end up being somewhat of a trade-off.  Ideally, you
>>> want
>>> it matched to the smallest possible chunk of duplicate data greater than
>>> or
>>> equal to the filesystem block size for maximal space efficiency.  Doing
>>> this
>>> however makes the extra processing done by duperemove take exponentially
>>> longer because it has to calculate hashes for more blocks (this has very
>>> low
>>> impact until you get to very small block sizes), and has to make
>>> exponentially more comparisons (this has a very big impact as you shrink
>>> the
>>> block size, just halving the block size will roughly quadruple the time
>>> it
>>> takes to make the comparisons).
>>>
>>>>
>>>> 2017-01-03 20:37 GMT+01:00 Austin S. Hemmelgarn <ahferro...@gmail.com>:
>>>>>
>>>>>
>>>>> On 2017-01-03 14:21, Peter Becker wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> All invocations are justified, but not relevant in (offline) backup
>>>>>> and archive scenarios.
>>>>>>
>>>>>> For example you have multiple version of append-only log-files or
>>>>>> append-only db-files (each more then 100GB in size), like this:
>>>>>>
>>>>>>> Snapshot_01_01_2017
>>>>>>
>>>>>>
>>>>>>
>>>>>> -> file1.log .. 201 GB
>>>>>>
>>>>>>> Snapshot_02_01_2017
>>>>>>
>>>>>>
>>>>>>
>>>>>> -> file1.log .. 205 GB
>>>>>>
>>>>>>> Snapshot_03_01_2017
>>>>>>
>>>>>>
>>>>>>
>>>>>> -> file1.log .. 221 GB
>>>>>>
>>>>>> The first 201 GB would be every time the same.
>>>>>> Files a copied at night from windows, linux or bsd systems and
>>>>>> snapshoted after copy.
>>>>>>
>>>>>> So a fast way to dedupe this is needed. Using 128KB blocks would
>>>>>> result in 1646592 extends per Snapshot. 1MB blocksize results in
>>>>>> 205.824 extends (not bad, but still terrible speed).
>>>>>> I will test it at night with a patched version of duperemove with
>>>>>> 100MB blocksize, but I have no hope that the throughput increases
>>>>>> thereby.
>>>>>
>>>>>
>>>>>
>>>>> Deduplication is not a general purpose thing (usually you have very
>>>>> specifically structured data), but duperemove is supposed to be a
>>>>> general
>>>>> purpose tool.  It works fine for two of the most common cases
>>>>> (deduplicating
>>>>> large numbers of small files or small numbers of large files), but it
>>>>> is
>>>>> sub-optimal for those cases, and will be for almost any other case.
>>>>> This
>>>>> is
>>>>> a canonical example though of a case where you can use a custom script
>>>>> or
>>>>> program to figure out what's duplicated and then have that just call
>>>>> the
>>>>> ioctl as appropriate itself.  Most cases where you are storing some
>>>>> kind
>>>>> of
>>>>> well structured data fall into this category.  In fact, both of the
>>>>> cases
>>>>> where I use deduplication myself fall into such a category.  One case
>>>>> involves multiple directories that are partial copies of a larger tree
>>>>> which
>>>>> are kept in sync with the larger tree and each other.  In that
>>>>> particular
>>>>> case, I care about whole file deduplication, so I have a script that
>>>>> just
>>>>> matches on path relative to the roots of each copy and the master copy,
>>>>> verifies that the mtime and size are the same, and if so calls the
>>>>> ioctl
>>>>> for
>>>>> deduplication (with some fancy processing to fit within the max size
>>>>> supported by the ioctl and prevent tiny tail extents).  The other case
>>>>> is
>>>>> a
>>>>> set of archives with a pretty solid fixed structure to them.  In that
>>>>> case,
>>>>> I have a different script that knows enough about the file structure to
>>>>> know
>>>>> where to look for duplicate blocks, thus avoiding having to hash the
>>>>> whole
>>>>> files.
>>>>>
>>>>> The append-only log/database case fits this type of thing perfectly,
>>>>> for
>>>>> each subsequent file, you know already that (most of) the file up to
>>>>> the
>>>>> length of the previous file is duplicated, so you can just split that
>>>>> however you want into chunks and pass those to the dedupe ioctl and
>>>>> free
>>>>> up
>>>>> most of the space that would be used by the new file.  You can then run
>>>>> duperemove with a hash-file to process any new blocks beyond the point
>>>>> you
>>>>> deduplicated up to to reclaim any excess space (currently this will
>>>>> process
>>>>> the whole file, but it should see that most of it is deduplicated
>>>>> already).
>>>>>
>>>>>>
>>>>>> For backup and archive scenarios the checksum-feature and the
>>>>>> dub-data/metadata-feature of btrfs is realy nice. In particular if one
>>>>>> considers the 7 years legally prescribed storage time.
>>>>>>
>>>>>> 2017-01-03 13:40 GMT+01:00 Austin S. Hemmelgarn
>>>>>> <ahferro...@gmail.com>:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 2016-12-30 15:28, Peter Becker wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hello, i have a 8 TB volume with multiple files with hundreds of GB
>>>>>>>> each.
>>>>>>>> I try to dedupe this because the first hundred GB of many files are
>>>>>>>> identical.
>>>>>>>> With 128KB blocksize with nofiemap and lookup-extends=no option,
>>>>>>>> will
>>>>>>>> take more then a week (only dedupe, previously hashed). So i tryed
>>>>>>>> -b
>>>>>>>> 100M but this returned me an error: "Blocksize is bounded ...".
>>>>>>>>
>>>>>>>> The reason is that the blocksize is limit to
>>>>>>>>
>>>>>>>> #define MAX_BLOCKSIZE (1024U*1024)
>>>>>>>>
>>>>>>>> But i can't found any description why.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Beyond what Xin mentioned (namely that 1MB is a much larger block
>>>>>>> than
>>>>>>> will
>>>>>>> be duplicated in most data-sets), there are a couple of other
>>>>>>> reasons:
>>>>>>> 1. Smaller blocks will actually get you better deduplication on
>>>>>>> average
>>>>>>> because they're more likely to match.  As an example, assume you have
>>>>>>> 2
>>>>>>> files with the same 8 4k blocks in different orders:
>>>>>>>   FileA: 1 2 3 4 5 6 7 8
>>>>>>>   FileB: 7 8 5 6 3 4 1 2
>>>>>>> In such a case, deduplicating at any block size above 8k would result
>>>>>>> in
>>>>>>> zero deduplication between these files, while 8k or less would
>>>>>>> completely
>>>>>>> deduplicate them.  This is of course a highly specific and somewhat
>>>>>>> contrived example (in most cases it will be scattered duplicate
>>>>>>> blocks
>>>>>>> over
>>>>>>> dozens of files), but it does convey this specific point.
>>>>>>> 2. The kernel will do a byte-wise comparison of all ranges you pass
>>>>>>> into
>>>>>>> the
>>>>>>> ioctl at the same time.  Larger block sizes here mean that:
>>>>>>>         a) The extents will be locked longer, which will prevent any
>>>>>>> I/O
>>>>>>> to
>>>>>>> the files being deduplicated for the duration of the comparison,
>>>>>>> which
>>>>>>> may
>>>>>>> in turn cause other issues on the system.
>>>>>>>         b) The deduplication process will be stuck in uninterruptible
>>>>>>> sleep
>>>>>>> longer, which on many systems will trigger hung task detection, which
>>>>>>> will
>>>>>>> in turn either spam the system log or panic the system depending on
>>>>>>> how
>>>>>>> it's
>>>>>>> configured.
>>>>>>>
>>>>>
>>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [markfasheh/duperemove] Why blocksize is limit to 1MB?

Reply via email to