Re: [Qemu-devel] [RFC V4 00/30] QCOW2 deduplication
On Thu, Jan 03, 2013 at 01:51:02PM -0600, Troy Benjegerdes wrote: On Thu, Jan 03, 2013 at 01:39:48PM +0100, Stefan Hajnoczi wrote: On Wed, Jan 02, 2013 at 12:26:37PM -0600, Troy Benjegerdes wrote: The probability may be 'low' but it is not zero. Just because it's hard to calculate the hash doesn't mean you can't do it. If your input data is not random the probability of a hash collision is going to get scewed. The cost of catching hash collisions is an extra read for every write. It's possible to reduce this with a 2nd hash function and/or caching. I'm not sure it's worth it given the extremely low probability of a hash collision. Venti is an example of an existing system where hash collisions were ignored because the probability is so low. See 3.1. Choice of Hash Function section: http://plan9.bell-labs.com/sys/doc/venti/venti.html If you believe that it's 'extremely low', then please provide either: * experimental evidence to prove your claim * an insurance underwriter who will pay-out if data is lost due to a hash collision. Read the paper, the point is that if the probability of collision is so extremely low, then it's not worth worrying about since other effects are much more likely (i.e. cosmic rays). The TCP/IP checksums are weak and not comparable to what Benoit is using. Stefan
Re: [Qemu-devel] [RFC V4 00/30] QCOW2 deduplication
On Wed, Jan 02, 2013 at 12:26:37PM -0600, Troy Benjegerdes wrote: The probability may be 'low' but it is not zero. Just because it's hard to calculate the hash doesn't mean you can't do it. If your input data is not random the probability of a hash collision is going to get scewed. The cost of catching hash collisions is an extra read for every write. It's possible to reduce this with a 2nd hash function and/or caching. I'm not sure it's worth it given the extremely low probability of a hash collision. Venti is an example of an existing system where hash collisions were ignored because the probability is so low. See 3.1. Choice of Hash Function section: http://plan9.bell-labs.com/sys/doc/venti/venti.html Stefan
Re: [Qemu-devel] [RFC V4 00/30] QCOW2 deduplication
Hello, I started to write the deduplication metrics code in order to be able to design asynchronous deduplication. I am looking for a way to create a metric allowing deduplication to be paused or resumed on a given threshold. Does anyone have a sugestion regarding the metric that could be used for this ? Best regards Benoît Le Wednesday 02 Jan 2013 à 17:16:03 (+0100), Benoît Canet a écrit : This patchset is a cleanup of the previous QCOW2 deduplication rfc. One can compile and install https://github.com/wernerd/Skein3Fish and use the --enable-skein-dedup configure option in order to use the faster skein HASH. Images must be created with -o dedup=[skein|sha256] in order to activate the deduplication in the image. Deduplication is now fast enough to be usable. v4: Fix and complete qcow2 spec [Stefan] Hash the hash_algo field in the header extension [Stefan] Fix qcow2 spec [Eric] Remove pointer to hash and simplify hash memory management [Stefan] Rename and move qcow2_read_cluster_data to qcow2.c [Stefan] Document lock dropping behaviour of the previous function [Stefan] cleanup qcow2_dedup_read_missing_cluster_data [Stefan] rename *_offset to *_sect [Stefan] add a ./configure check for ssl [Stefan] Replace openssl by gnutls [Stefan] Implement Skein hashes Rewrite pretty every qcow2-dedup.c commits after Add qcow2_dedup_read_missing_and_concatenate to simplify the code Use 64KB deduplication hash block to reduce allocation flushes Use 64KB l2 tables to reduce allocation flushes [breaks compatibility] Use lazy refcounts to avoid qcow2_cache_set_dependency loops resultings in frequent caches flushes Do not create and load dedup RAM structures when bdrs-read_only is true v3: make it work barely replace kernel red black trees by gtree. *** BLURB HERE *** Benoît Canet (30): qcow2: Add deduplication to the qcow2 specification. qcow2: Add deduplication structures and fields. qcow2: Add qcow2_dedup_read_missing_and_concatenate qcow2: Make update_refcount public. qcow2: Create a way to link to l2 tables when deduplicating. qcow2: Add qcow2_dedup and related functions qcow2: Add qcow2_dedup_store_new_hashes. qcow2: Implement qcow2_compute_cluster_hash. qcow2: Extract qcow2_dedup_grow_table qcow2: Add qcow2_dedup_grow_table and use it. qcow2: create function to load deduplication hashes at startup. qcow2: Load and save deduplication table header extension. qcow2: Extract qcow2_do_table_init. qcow2-cache: Allow to choose table size at creation. qcow2: Add qcow2_dedup_init and qcow2_dedup_close. qcow2: Extract qcow2_add_feature and qcow2_remove_feature. block: Add qemu-img dedup create option. qcow2: Behave correctly when refcount reach 0 or 2^16. qcow2: Integrate deduplication in qcow2_co_writev loop. qcow2: Serialize write requests when deduplication is activated. qcow2: Add verification of dedup table. qcow2: Adapt checking of QCOW_OFLAG_COPIED for dedup. qcow2: Add check_dedup_l2 in order to check l2 of dedup table. qcow2: Do not overwrite existing entries with QCOW_OFLAG_COPIED. qcow2: Integrate SKEIN hash algorithm in deduplication. qcow2: Add lazy refcounts to deduplication to prevent qcow2_cache_set_dependency loops qcow2: Use large L2 table for deduplication. qcow: Set dedup cluster block size to 64KB. qcow2: init and cleanup deduplication. qemu-iotests: Filter dedup=on/off so existing tests don't break. block/Makefile.objs |1 + block/qcow2-cache.c | 12 +- block/qcow2-cluster.c| 116 +++-- block/qcow2-dedup.c | 1157 ++ block/qcow2-refcount.c | 157 -- block/qcow2.c| 357 +++-- block/qcow2.h| 120 - configure| 55 ++ docs/specs/qcow2.txt | 100 +++- include/block/block_int.h|1 + tests/qemu-iotests/common.rc |3 +- 11 files changed, 1955 insertions(+), 124 deletions(-) create mode 100644 block/qcow2-dedup.c -- 1.7.10.4
Re: [Qemu-devel] [RFC V4 00/30] QCOW2 deduplication
On Thu, Jan 03, 2013 at 01:39:48PM +0100, Stefan Hajnoczi wrote: On Wed, Jan 02, 2013 at 12:26:37PM -0600, Troy Benjegerdes wrote: The probability may be 'low' but it is not zero. Just because it's hard to calculate the hash doesn't mean you can't do it. If your input data is not random the probability of a hash collision is going to get scewed. The cost of catching hash collisions is an extra read for every write. It's possible to reduce this with a 2nd hash function and/or caching. I'm not sure it's worth it given the extremely low probability of a hash collision. Venti is an example of an existing system where hash collisions were ignored because the probability is so low. See 3.1. Choice of Hash Function section: http://plan9.bell-labs.com/sys/doc/venti/venti.html If you believe that it's 'extremely low', then please provide either: * experimental evidence to prove your claim * an insurance underwriter who will pay-out if data is lost due to a hash collision. What I have heard so far is a lot of theoretical posturing and no experimental evidence. Please google for when TCP checksums and CRC disagree for experimental evidence of problems assuming that probability is low. This is the abstract: Traces of Internet packets from the past two years show that between 1 packet in 1,100 and 1 packet in 32,000 fails the TCP checksum, even on links where link-level CRCs should catch all but 1 in 4 billion errors. For certain situations, the rate of checksum failures can be even higher: in one hour-long test we observed a checksum failure of 1 packet in 400. We investigate why so many errors are observed, when link-level CRCs should catch nearly all of them.We have collected nearly 500,000 packets which failed the TCP or UDP or IP checksum. This dataset shows the Internet has a wide variety of error sources which can not be detected by link-level checks. We describe analysis tools that have identified nearly 100 different error patterns. Categorizing packet errors, we can infer likely causes which explain roughly half the observed errors. The causes span the entire spectrum of a network stack, from memory errors to bugs in TCP.After an analysis we conclude that the checksum will fail to detect errors for roughly 1 in 16 million to 10 billion packets. From our analysis of the cause of errors, we propose simple changes to several protocols which will decrease the rate of undetected error. Even so, the highly non-random distribution of errors strongly suggests some applications should employ application-level checksums or equivalents.
Re: [Qemu-devel] [RFC V4 00/30] QCOW2 deduplication
Venti is an example of an existing system where hash collisions were ignored because the probability is so low. See 3.1. Choice of Hash Function section: http://plan9.bell-labs.com/sys/doc/venti/venti.html If you believe that it's 'extremely low', then please provide either: * experimental evidence to prove your claim * an insurance underwriter who will pay-out if data is lost due to a hash collision. What I have heard so far is a lot of theoretical posturing and no experimental evidence. Venti is a well-known system, in use for more than 10 years - isn't that enough experimental evidence? - Dietmar
[Qemu-devel] [RFC V4 00/30] QCOW2 deduplication
This patchset is a cleanup of the previous QCOW2 deduplication rfc. One can compile and install https://github.com/wernerd/Skein3Fish and use the --enable-skein-dedup configure option in order to use the faster skein HASH. Images must be created with -o dedup=[skein|sha256] in order to activate the deduplication in the image. Deduplication is now fast enough to be usable. v4: Fix and complete qcow2 spec [Stefan] Hash the hash_algo field in the header extension [Stefan] Fix qcow2 spec [Eric] Remove pointer to hash and simplify hash memory management [Stefan] Rename and move qcow2_read_cluster_data to qcow2.c [Stefan] Document lock dropping behaviour of the previous function [Stefan] cleanup qcow2_dedup_read_missing_cluster_data [Stefan] rename *_offset to *_sect [Stefan] add a ./configure check for ssl [Stefan] Replace openssl by gnutls [Stefan] Implement Skein hashes Rewrite pretty every qcow2-dedup.c commits after Add qcow2_dedup_read_missing_and_concatenate to simplify the code Use 64KB deduplication hash block to reduce allocation flushes Use 64KB l2 tables to reduce allocation flushes [breaks compatibility] Use lazy refcounts to avoid qcow2_cache_set_dependency loops resultings in frequent caches flushes Do not create and load dedup RAM structures when bdrs-read_only is true v3: make it work barely replace kernel red black trees by gtree. *** BLURB HERE *** Benoît Canet (30): qcow2: Add deduplication to the qcow2 specification. qcow2: Add deduplication structures and fields. qcow2: Add qcow2_dedup_read_missing_and_concatenate qcow2: Make update_refcount public. qcow2: Create a way to link to l2 tables when deduplicating. qcow2: Add qcow2_dedup and related functions qcow2: Add qcow2_dedup_store_new_hashes. qcow2: Implement qcow2_compute_cluster_hash. qcow2: Extract qcow2_dedup_grow_table qcow2: Add qcow2_dedup_grow_table and use it. qcow2: create function to load deduplication hashes at startup. qcow2: Load and save deduplication table header extension. qcow2: Extract qcow2_do_table_init. qcow2-cache: Allow to choose table size at creation. qcow2: Add qcow2_dedup_init and qcow2_dedup_close. qcow2: Extract qcow2_add_feature and qcow2_remove_feature. block: Add qemu-img dedup create option. qcow2: Behave correctly when refcount reach 0 or 2^16. qcow2: Integrate deduplication in qcow2_co_writev loop. qcow2: Serialize write requests when deduplication is activated. qcow2: Add verification of dedup table. qcow2: Adapt checking of QCOW_OFLAG_COPIED for dedup. qcow2: Add check_dedup_l2 in order to check l2 of dedup table. qcow2: Do not overwrite existing entries with QCOW_OFLAG_COPIED. qcow2: Integrate SKEIN hash algorithm in deduplication. qcow2: Add lazy refcounts to deduplication to prevent qcow2_cache_set_dependency loops qcow2: Use large L2 table for deduplication. qcow: Set dedup cluster block size to 64KB. qcow2: init and cleanup deduplication. qemu-iotests: Filter dedup=on/off so existing tests don't break. block/Makefile.objs |1 + block/qcow2-cache.c | 12 +- block/qcow2-cluster.c| 116 +++-- block/qcow2-dedup.c | 1157 ++ block/qcow2-refcount.c | 157 -- block/qcow2.c| 357 +++-- block/qcow2.h| 120 - configure| 55 ++ docs/specs/qcow2.txt | 100 +++- include/block/block_int.h|1 + tests/qemu-iotests/common.rc |3 +- 11 files changed, 1955 insertions(+), 124 deletions(-) create mode 100644 block/qcow2-dedup.c -- 1.7.10.4
Re: [Qemu-devel] [RFC V4 00/30] QCOW2 deduplication
How does this code handle hash collisions, and do you have some regression tests that purposefully create a dedup hash collision, and verify that the 'right thing' happens? The two hash function that can be used are cryptographics and not broken yet. So nobody knows how to generate a collision. You can do the math to calculate the probability of collision using a 256 bit hash while processing 1EiB of data the result is so low you can consider it won't happen. The sha256 ZFS deduplication works the same way regarding collisions. I currently use qemu-io-test for testing purpose and iozone with the -w flag in the guest. I would like to find a good deduplication stress test to run in a guest. Regards Benoît It's great that this almost works, but it seems rather dangerous to put something like this into the mainline code without some regression tests. (I'm also suspecting the regression test will be a great way to find flakey hardware) -- Troy Benjegerdes'da hozer' ho...@hozed.org Somone asked my why I work on this free (http://www.fsf.org/philosophy/) software hardware (http://q3u.be) stuff and not get a real job. Charles Shultz had the best answer: Why do musicians compose symphonies and poets write poems? They do it because life wouldn't have any meaning for them if they didn't. That's why I draw cartoons. It's my life. -- Charles Shultz
Re: [Qemu-devel] [RFC V4 00/30] QCOW2 deduplication
On Wed, Jan 02, 2013 at 05:16:03PM +0100, Beno??t Canet wrote: This patchset is a cleanup of the previous QCOW2 deduplication rfc. One can compile and install https://github.com/wernerd/Skein3Fish and use the --enable-skein-dedup configure option in order to use the faster skein HASH. Images must be created with -o dedup=[skein|sha256] in order to activate the deduplication in the image. Deduplication is now fast enough to be usable. How does this code handle hash collisions, and do you have some regression tests that purposefully create a dedup hash collision, and verify that the 'right thing' happens? The next question is .. what's the right thing? It's great that this almost works, but it seems rather dangerous to put something like this into the mainline code without some regression tests. (I'm also suspecting the regression test will be a great way to find flakey hardware) -- Troy Benjegerdes'da hozer' ho...@hozed.org Somone asked my why I work on this free (http://www.fsf.org/philosophy/) software hardware (http://q3u.be) stuff and not get a real job. Charles Shultz had the best answer: Why do musicians compose symphonies and poets write poems? They do it because life wouldn't have any meaning for them if they didn't. That's why I draw cartoons. It's my life. -- Charles Shultz
Re: [Qemu-devel] [RFC V4 00/30] QCOW2 deduplication
On 01/02/2013 10:33 AM, Benoît Canet wrote: How does this code handle hash collisions, and do you have some regression tests that purposefully create a dedup hash collision, and verify that the 'right thing' happens? The two hash function that can be used are cryptographics and not broken yet. So nobody knows how to generate a collision. I can understand that it is hard to write a test for two distinct data sectors hashing to the same value, but perhaps it's worth including a debug-only hash algorithm that intentionally generates collisions, just to prove that you handle them correctly. De-duplicating collided data, while unlikely, is still a case of data loss that not everyone is happy to risk. -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [RFC V4 00/30] QCOW2 deduplication
I think I can easily add a verify option at image creation. This way the code would read the cluster already on disk and compare it with the cluster to write. If there are different it would print some debug message and return -EIO to the upper layers. Le Wednesday 02 Jan 2013 à 11:01:04 (-0700), Eric Blake a écrit : On 01/02/2013 10:33 AM, Benoît Canet wrote: How does this code handle hash collisions, and do you have some regression tests that purposefully create a dedup hash collision, and verify that the 'right thing' happens? The two hash function that can be used are cryptographics and not broken yet. So nobody knows how to generate a collision. I can understand that it is hard to write a test for two distinct data sectors hashing to the same value, but perhaps it's worth including a debug-only hash algorithm that intentionally generates collisions, just to prove that you handle them correctly. De-duplicating collided data, while unlikely, is still a case of data loss that not everyone is happy to risk. -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org
Re: [Qemu-devel] [RFC V4 00/30] QCOW2 deduplication
The probability may be 'low' but it is not zero. Just because it's hard to calculate the hash doesn't mean you can't do it. If your input data is not random the probability of a hash collision is going to get scewed. Read about how Bitcoin uses hashes. I need a budget of around $10,000 or so for some FPGAs and/or GPU cards, and I can make a regression test that will create deduplication hash collisions on purpose. On Wed, Jan 02, 2013 at 06:33:24PM +0100, Beno?t Canet wrote: How does this code handle hash collisions, and do you have some regression tests that purposefully create a dedup hash collision, and verify that the 'right thing' happens? The two hash function that can be used are cryptographics and not broken yet. So nobody knows how to generate a collision. You can do the math to calculate the probability of collision using a 256 bit hash while processing 1EiB of data the result is so low you can consider it won't happen. The sha256 ZFS deduplication works the same way regarding collisions. I currently use qemu-io-test for testing purpose and iozone with the -w flag in the guest. I would like to find a good deduplication stress test to run in a guest. Regards Beno?t It's great that this almost works, but it seems rather dangerous to put something like this into the mainline code without some regression tests. (I'm also suspecting the regression test will be a great way to find flakey hardware) -- Troy Benjegerdes'da hozer' ho...@hozed.org Somone asked my why I work on this free (http://www.fsf.org/philosophy/) software hardware (http://q3u.be) stuff and not get a real job. Charles Shultz had the best answer: Why do musicians compose symphonies and poets write poems? They do it because life wouldn't have any meaning for them if they didn't. That's why I draw cartoons. It's my life. -- Charles Shultz -- -- Troy Benjegerdes'da hozer' ho...@hozed.org Somone asked my why I work on this free (http://www.fsf.org/philosophy/) software hardware (http://q3u.be) stuff and not get a real job. Charles Shultz had the best answer: Why do musicians compose symphonies and poets write poems? They do it because life wouldn't have any meaning for them if they didn't. That's why I draw cartoons. It's my life. -- Charles Shultz
Re: [Qemu-devel] [RFC V4 00/30] QCOW2 deduplication
Le Wednesday 02 Jan 2013 à 12:26:37 (-0600), Troy Benjegerdes a écrit : The probability may be 'low' but it is not zero. Just because it's hard to calculate the hash doesn't mean you can't do it. If your input data is not random the probability of a hash collision is going to get scewed. Read about how Bitcoin uses hashes. I need a budget of around $10,000 or so for some FPGAs and/or GPU cards, and I can make a regression test that will create deduplication hash collisions on purpose. It's not a problem as Eric pointed out while reviewing the previous patchset there is a small place left with zeroes on the deduplication block. A bit could be set on it when a collision is detected and an offset could point to a cluster used to resolve collisions. On Wed, Jan 02, 2013 at 06:33:24PM +0100, Beno?t Canet wrote: How does this code handle hash collisions, and do you have some regression tests that purposefully create a dedup hash collision, and verify that the 'right thing' happens? The two hash function that can be used are cryptographics and not broken yet. So nobody knows how to generate a collision. You can do the math to calculate the probability of collision using a 256 bit hash while processing 1EiB of data the result is so low you can consider it won't happen. The sha256 ZFS deduplication works the same way regarding collisions. I currently use qemu-io-test for testing purpose and iozone with the -w flag in the guest. I would like to find a good deduplication stress test to run in a guest. Regards Beno?t It's great that this almost works, but it seems rather dangerous to put something like this into the mainline code without some regression tests. (I'm also suspecting the regression test will be a great way to find flakey hardware) -- Troy Benjegerdes'da hozer' ho...@hozed.org Somone asked my why I work on this free (http://www.fsf.org/philosophy/) software hardware (http://q3u.be) stuff and not get a real job. Charles Shultz had the best answer: Why do musicians compose symphonies and poets write poems? They do it because life wouldn't have any meaning for them if they didn't. That's why I draw cartoons. It's my life. -- Charles Shultz -- -- Troy Benjegerdes'da hozer' ho...@hozed.org Somone asked my why I work on this free (http://www.fsf.org/philosophy/) software hardware (http://q3u.be) stuff and not get a real job. Charles Shultz had the best answer: Why do musicians compose symphonies and poets write poems? They do it because life wouldn't have any meaning for them if they didn't. That's why I draw cartoons. It's my life. -- Charles Shultz
Re: [Qemu-devel] [RFC V4 00/30] QCOW2 deduplication
Do you really need to resolve the conflicts? It might be easier and sufficient to just flag those hashes where a conflict has been detected as : dont dedup this hash anymore, collissions have been seen. On Wed, Jan 2, 2013 at 10:40 AM, Benoît Canet benoit.ca...@irqsave.net wrote: Le Wednesday 02 Jan 2013 à 12:26:37 (-0600), Troy Benjegerdes a écrit : The probability may be 'low' but it is not zero. Just because it's hard to calculate the hash doesn't mean you can't do it. If your input data is not random the probability of a hash collision is going to get scewed. Read about how Bitcoin uses hashes. I need a budget of around $10,000 or so for some FPGAs and/or GPU cards, and I can make a regression test that will create deduplication hash collisions on purpose. It's not a problem as Eric pointed out while reviewing the previous patchset there is a small place left with zeroes on the deduplication block. A bit could be set on it when a collision is detected and an offset could point to a cluster used to resolve collisions. On Wed, Jan 02, 2013 at 06:33:24PM +0100, Beno?t Canet wrote: How does this code handle hash collisions, and do you have some regression tests that purposefully create a dedup hash collision, and verify that the 'right thing' happens? The two hash function that can be used are cryptographics and not broken yet. So nobody knows how to generate a collision. You can do the math to calculate the probability of collision using a 256 bit hash while processing 1EiB of data the result is so low you can consider it won't happen. The sha256 ZFS deduplication works the same way regarding collisions. I currently use qemu-io-test for testing purpose and iozone with the -w flag in the guest. I would like to find a good deduplication stress test to run in a guest. Regards Beno?t It's great that this almost works, but it seems rather dangerous to put something like this into the mainline code without some regression tests. (I'm also suspecting the regression test will be a great way to find flakey hardware) -- Troy Benjegerdes'da hozer' ho...@hozed.org Somone asked my why I work on this free (http://www.fsf.org/philosophy/) software hardware (http://q3u.be) stuff and not get a real job. Charles Shultz had the best answer: Why do musicians compose symphonies and poets write poems? They do it because life wouldn't have any meaning for them if they didn't. That's why I draw cartoons. It's my life. -- Charles Shultz -- -- Troy Benjegerdes'da hozer' ho...@hozed.org Somone asked my why I work on this free (http://www.fsf.org/philosophy/) software hardware (http://q3u.be) stuff and not get a real job. Charles Shultz had the best answer: Why do musicians compose symphonies and poets write poems? They do it because life wouldn't have any meaning for them if they didn't. That's why I draw cartoons. It's my life. -- Charles Shultz
Re: [Qemu-devel] [RFC V4 00/30] QCOW2 deduplication
Le Wednesday 02 Jan 2013 à 10:47:48 (-0800), ronnie sahlberg a écrit : Do you really need to resolve the conflicts? It might be easier and sufficient to just flag those hashes where a conflict has been detected as : dont dedup this hash anymore, collissions have been seen. True, that's more elegant. The user would still need to specify the verify option at creation and it would require to do a read before verify but it would not make the qcow2 format uglier. On Wed, Jan 2, 2013 at 10:40 AM, Benoît Canet benoit.ca...@irqsave.net wrote: Le Wednesday 02 Jan 2013 à 12:26:37 (-0600), Troy Benjegerdes a écrit : The probability may be 'low' but it is not zero. Just because it's hard to calculate the hash doesn't mean you can't do it. If your input data is not random the probability of a hash collision is going to get scewed. Read about how Bitcoin uses hashes. I need a budget of around $10,000 or so for some FPGAs and/or GPU cards, and I can make a regression test that will create deduplication hash collisions on purpose. It's not a problem as Eric pointed out while reviewing the previous patchset there is a small place left with zeroes on the deduplication block. A bit could be set on it when a collision is detected and an offset could point to a cluster used to resolve collisions. On Wed, Jan 02, 2013 at 06:33:24PM +0100, Beno?t Canet wrote: How does this code handle hash collisions, and do you have some regression tests that purposefully create a dedup hash collision, and verify that the 'right thing' happens? The two hash function that can be used are cryptographics and not broken yet. So nobody knows how to generate a collision. You can do the math to calculate the probability of collision using a 256 bit hash while processing 1EiB of data the result is so low you can consider it won't happen. The sha256 ZFS deduplication works the same way regarding collisions. I currently use qemu-io-test for testing purpose and iozone with the -w flag in the guest. I would like to find a good deduplication stress test to run in a guest. Regards Beno?t It's great that this almost works, but it seems rather dangerous to put something like this into the mainline code without some regression tests. (I'm also suspecting the regression test will be a great way to find flakey hardware) -- Troy Benjegerdes'da hozer' ho...@hozed.org Somone asked my why I work on this free (http://www.fsf.org/philosophy/) software hardware (http://q3u.be) stuff and not get a real job. Charles Shultz had the best answer: Why do musicians compose symphonies and poets write poems? They do it because life wouldn't have any meaning for them if they didn't. That's why I draw cartoons. It's my life. -- Charles Shultz -- -- Troy Benjegerdes'da hozer' ho...@hozed.org Somone asked my why I work on this free (http://www.fsf.org/philosophy/) software hardware (http://q3u.be) stuff and not get a real job. Charles Shultz had the best answer: Why do musicians compose symphonies and poets write poems? They do it because life wouldn't have any meaning for them if they didn't. That's why I draw cartoons. It's my life. -- Charles Shultz
Re: [Qemu-devel] [RFC V4 00/30] QCOW2 deduplication
If you do get a hash collision, it's a rather exceptional event, so I'd say every effort should be made to log the event and the data that created it in multiple places. There are three questions I'd ask on a hash collision: 1) was it the data? 2) was it the hardware? 3) was it a software bug? On Wed, Jan 02, 2013 at 10:47:48AM -0800, ronnie sahlberg wrote: Do you really need to resolve the conflicts? It might be easier and sufficient to just flag those hashes where a conflict has been detected as : dont dedup this hash anymore, collissions have been seen. On Wed, Jan 2, 2013 at 10:40 AM, Beno?t Canet benoit.ca...@irqsave.net wrote: Le Wednesday 02 Jan 2013 ? 12:26:37 (-0600), Troy Benjegerdes a ?crit : The probability may be 'low' but it is not zero. Just because it's hard to calculate the hash doesn't mean you can't do it. If your input data is not random the probability of a hash collision is going to get scewed. Read about how Bitcoin uses hashes. I need a budget of around $10,000 or so for some FPGAs and/or GPU cards, and I can make a regression test that will create deduplication hash collisions on purpose. It's not a problem as Eric pointed out while reviewing the previous patchset there is a small place left with zeroes on the deduplication block. A bit could be set on it when a collision is detected and an offset could point to a cluster used to resolve collisions. On Wed, Jan 02, 2013 at 06:33:24PM +0100, Beno?t Canet wrote: How does this code handle hash collisions, and do you have some regression tests that purposefully create a dedup hash collision, and verify that the 'right thing' happens? The two hash function that can be used are cryptographics and not broken yet. So nobody knows how to generate a collision. You can do the math to calculate the probability of collision using a 256 bit hash while processing 1EiB of data the result is so low you can consider it won't happen. The sha256 ZFS deduplication works the same way regarding collisions. I currently use qemu-io-test for testing purpose and iozone with the -w flag in the guest. I would like to find a good deduplication stress test to run in a guest. Regards Beno?t It's great that this almost works, but it seems rather dangerous to put something like this into the mainline code without some regression tests. (I'm also suspecting the regression test will be a great way to find flakey hardware) -- Troy Benjegerdes'da hozer' ho...@hozed.org Somone asked my why I work on this free (http://www.fsf.org/philosophy/) software hardware (http://q3u.be) stuff and not get a real job. Charles Shultz had the best answer: Why do musicians compose symphonies and poets write poems? They do it because life wouldn't have any meaning for them if they didn't. That's why I draw cartoons. It's my life. -- Charles Shultz -- -- Troy Benjegerdes'da hozer' ho...@hozed.org Somone asked my why I work on this free (http://www.fsf.org/philosophy/) software hardware (http://q3u.be) stuff and not get a real job. Charles Shultz had the best answer: Why do musicians compose symphonies and poets write poems? They do it because life wouldn't have any meaning for them if they didn't. That's why I draw cartoons. It's my life. -- Charles Shultz -- -- Troy Benjegerdes'da hozer' ho...@hozed.org Somone asked my why I work on this free (http://www.fsf.org/philosophy/) software hardware (http://q3u.be) stuff and not get a real job. Charles Shultz had the best answer: Why do musicians compose symphonies and poets write poems? They do it because life wouldn't have any meaning for them if they didn't. That's why I draw cartoons. It's my life. -- Charles Shultz
Re: [Qemu-devel] [RFC V4 00/30] QCOW2 deduplication
On Wed, Jan 2, 2013 at 11:18 AM, Troy Benjegerdes ho...@hozed.org wrote: If you do get a hash collision, it's a rather exceptional event, so I'd say every effort should be made to log the event and the data that created it in multiple places. There are three questions I'd ask on a hash collision: 1) was it the data? 2) was it the hardware? 3) was it a software bug? Yes, that is probably good too, and saving off the old and new block content that collided. Unless you are checksumming the blocks, I suspect that the most common reason for collisions would just be cases where the original block was corrupted/changed on disk and you dont detect it and then when you re-write an identical one the blocks no longer match and thus you get a false collision.