Re: [Qemu-devel] [RFC V4 00/30] QCOW2 deduplication

2013-01-04 Thread Stefan Hajnoczi
On Thu, Jan 03, 2013 at 01:51:02PM -0600, Troy Benjegerdes wrote:
 On Thu, Jan 03, 2013 at 01:39:48PM +0100, Stefan Hajnoczi wrote:
  On Wed, Jan 02, 2013 at 12:26:37PM -0600, Troy Benjegerdes wrote:
   The probability may be 'low' but it is not zero. Just because it's
   hard to calculate the hash doesn't mean you can't do it. If your
   input data is not random the probability of a hash collision is
   going to get scewed.
  
  The cost of catching hash collisions is an extra read for every write.
  It's possible to reduce this with a 2nd hash function and/or caching.
  
  I'm not sure it's worth it given the extremely low probability of a hash
  collision.
  
  Venti is an example of an existing system where hash collisions were
  ignored because the probability is so low.  See 3.1. Choice of Hash
  Function section:
  
  http://plan9.bell-labs.com/sys/doc/venti/venti.html
 
 
 If you believe that it's 'extremely low', then please provide either:
 
 * experimental evidence to prove your claim
 * an insurance underwriter who will pay-out if data is lost due to
 a hash collision.

Read the paper, the point is that if the probability of collision is so
extremely low, then it's not worth worrying about since other effects
are much more likely (i.e. cosmic rays).

The TCP/IP checksums are weak and not comparable to what Benoit is
using.

Stefan



Re: [Qemu-devel] [RFC V4 00/30] QCOW2 deduplication

2013-01-03 Thread Stefan Hajnoczi
On Wed, Jan 02, 2013 at 12:26:37PM -0600, Troy Benjegerdes wrote:
 The probability may be 'low' but it is not zero. Just because it's
 hard to calculate the hash doesn't mean you can't do it. If your
 input data is not random the probability of a hash collision is
 going to get scewed.

The cost of catching hash collisions is an extra read for every write.
It's possible to reduce this with a 2nd hash function and/or caching.

I'm not sure it's worth it given the extremely low probability of a hash
collision.

Venti is an example of an existing system where hash collisions were
ignored because the probability is so low.  See 3.1. Choice of Hash
Function section:

http://plan9.bell-labs.com/sys/doc/venti/venti.html

Stefan



Re: [Qemu-devel] [RFC V4 00/30] QCOW2 deduplication

2013-01-03 Thread Benoît Canet

Hello,

I started to write the deduplication metrics code in order to be able
to design asynchronous deduplication.

I am looking for a way to create a metric allowing deduplication to be paused
or resumed on a given threshold.

Does anyone have a sugestion regarding the metric that could be used for this ?

Best regards

Benoît

 Le Wednesday 02 Jan 2013 à 17:16:03 (+0100), Benoît Canet a écrit :
 This patchset is a cleanup of the previous QCOW2 deduplication rfc.
 
 One can compile and install https://github.com/wernerd/Skein3Fish and use the
 --enable-skein-dedup configure option in order to use the faster skein HASH.
 
 Images must be created with -o dedup=[skein|sha256] in order to activate the
 deduplication in the image.
 
 Deduplication is now fast enough to be usable.
 
 v4: Fix and complete qcow2 spec [Stefan]
 Hash the hash_algo field in the header extension [Stefan]
 Fix qcow2 spec [Eric]
 Remove pointer to hash and simplify hash memory management [Stefan]
 Rename and move qcow2_read_cluster_data to qcow2.c [Stefan]
 Document lock dropping behaviour of the previous function [Stefan]
 cleanup qcow2_dedup_read_missing_cluster_data [Stefan]
 rename *_offset to *_sect [Stefan]
 add a ./configure check for ssl [Stefan]
 Replace openssl by gnutls [Stefan]
 Implement Skein hashes
 Rewrite pretty every qcow2-dedup.c commits after Add
qcow2_dedup_read_missing_and_concatenate to simplify the code
 Use 64KB deduplication hash block to reduce allocation flushes
 Use 64KB l2 tables to reduce allocation flushes [breaks compatibility]
 Use lazy refcounts to avoid qcow2_cache_set_dependency loops resultings
in frequent caches flushes
 Do not create and load dedup RAM structures when bdrs-read_only is true
 
 v3: make it work barely
 replace kernel red black trees by gtree.
 
 *** BLURB HERE ***
 
 Benoît Canet (30):
   qcow2: Add deduplication to the qcow2 specification.
   qcow2: Add deduplication structures and fields.
   qcow2: Add qcow2_dedup_read_missing_and_concatenate
   qcow2: Make update_refcount public.
   qcow2: Create a way to link to l2 tables when deduplicating.
   qcow2: Add qcow2_dedup and related functions
   qcow2: Add qcow2_dedup_store_new_hashes.
   qcow2: Implement qcow2_compute_cluster_hash.
   qcow2: Extract qcow2_dedup_grow_table
   qcow2: Add qcow2_dedup_grow_table and use it.
   qcow2: create function to load deduplication hashes at startup.
   qcow2: Load and save deduplication table header extension.
   qcow2: Extract qcow2_do_table_init.
   qcow2-cache: Allow to choose table size at creation.
   qcow2: Add qcow2_dedup_init and qcow2_dedup_close.
   qcow2: Extract qcow2_add_feature and qcow2_remove_feature.
   block: Add qemu-img dedup create option.
   qcow2: Behave correctly when refcount reach 0 or 2^16.
   qcow2: Integrate deduplication in qcow2_co_writev loop.
   qcow2: Serialize write requests when deduplication is activated.
   qcow2: Add verification of dedup table.
   qcow2: Adapt checking of QCOW_OFLAG_COPIED for dedup.
   qcow2: Add check_dedup_l2 in order to check l2 of dedup table.
   qcow2: Do not overwrite existing entries with QCOW_OFLAG_COPIED.
   qcow2: Integrate SKEIN hash algorithm in deduplication.
   qcow2: Add lazy refcounts to deduplication to prevent
 qcow2_cache_set_dependency loops
   qcow2: Use large L2 table for deduplication.
   qcow: Set dedup cluster block size to 64KB.
   qcow2: init and cleanup deduplication.
   qemu-iotests: Filter dedup=on/off so existing tests don't break.
 
  block/Makefile.objs  |1 +
  block/qcow2-cache.c  |   12 +-
  block/qcow2-cluster.c|  116 +++--
  block/qcow2-dedup.c  | 1157 
 ++
  block/qcow2-refcount.c   |  157 --
  block/qcow2.c|  357 +++--
  block/qcow2.h|  120 -
  configure|   55 ++
  docs/specs/qcow2.txt |  100 +++-
  include/block/block_int.h|1 +
  tests/qemu-iotests/common.rc |3 +-
  11 files changed, 1955 insertions(+), 124 deletions(-)
  create mode 100644 block/qcow2-dedup.c
 
 -- 
 1.7.10.4
 



Re: [Qemu-devel] [RFC V4 00/30] QCOW2 deduplication

2013-01-03 Thread Troy Benjegerdes
On Thu, Jan 03, 2013 at 01:39:48PM +0100, Stefan Hajnoczi wrote:
 On Wed, Jan 02, 2013 at 12:26:37PM -0600, Troy Benjegerdes wrote:
  The probability may be 'low' but it is not zero. Just because it's
  hard to calculate the hash doesn't mean you can't do it. If your
  input data is not random the probability of a hash collision is
  going to get scewed.
 
 The cost of catching hash collisions is an extra read for every write.
 It's possible to reduce this with a 2nd hash function and/or caching.
 
 I'm not sure it's worth it given the extremely low probability of a hash
 collision.
 
 Venti is an example of an existing system where hash collisions were
 ignored because the probability is so low.  See 3.1. Choice of Hash
 Function section:
 
 http://plan9.bell-labs.com/sys/doc/venti/venti.html


If you believe that it's 'extremely low', then please provide either:

* experimental evidence to prove your claim
* an insurance underwriter who will pay-out if data is lost due to
a hash collision.

What I have heard so far is a lot of theoretical posturing and no
experimental evidence.

Please google for when TCP checksums and CRC disagree for experimental
evidence of problems assuming that probability is low. This is the
abstract:

Traces of Internet packets from the past two years show that between 1 packet 
in 1,100 and 1 packet in 32,000 fails the TCP checksum, even on links where 
link-level CRCs should catch all but 1 in 4 billion errors. For certain 
situations, the rate of checksum failures can be even higher: in one hour-long 
test we observed a checksum failure of 1 packet in 400. We investigate why so 
many errors are observed, when link-level CRCs should catch nearly all of 
them.We have collected nearly 500,000 packets which failed the TCP or UDP or IP 
checksum. This dataset shows the Internet has a wide variety of error sources 
which can not be detected by link-level checks. We describe analysis tools that 
have identified nearly 100 different error patterns. Categorizing packet 
errors, we can infer likely causes which explain roughly half the observed 
errors. The causes span the entire spectrum of a network stack, from memory 
errors to bugs in TCP.After an analysis we conclude that the checksum will fail 
to detect errors for roughly 1 in 16 million to 10 billion packets. From our 
analysis of the cause of errors, we propose simple changes to several protocols 
which will decrease the rate of undetected error. Even so, the highly 
non-random distribution of errors strongly suggests some applications should 
employ application-level checksums or equivalents.



Re: [Qemu-devel] [RFC V4 00/30] QCOW2 deduplication

2013-01-03 Thread Dietmar Maurer
  Venti is an example of an existing system where hash collisions were
  ignored because the probability is so low.  See 3.1. Choice of Hash
  Function section:
 
  http://plan9.bell-labs.com/sys/doc/venti/venti.html
 
 
 If you believe that it's 'extremely low', then please provide either:
 
 * experimental evidence to prove your claim
 * an insurance underwriter who will pay-out if data is lost due to a hash
 collision.
 
 What I have heard so far is a lot of theoretical posturing and no experimental
 evidence.

Venti is a well-known system, in use for more than 10 years - isn't that enough 
experimental evidence?

- Dietmar




[Qemu-devel] [RFC V4 00/30] QCOW2 deduplication

2013-01-02 Thread Benoît Canet
This patchset is a cleanup of the previous QCOW2 deduplication rfc.

One can compile and install https://github.com/wernerd/Skein3Fish and use the
--enable-skein-dedup configure option in order to use the faster skein HASH.

Images must be created with -o dedup=[skein|sha256] in order to activate the
deduplication in the image.

Deduplication is now fast enough to be usable.

v4: Fix and complete qcow2 spec [Stefan]
Hash the hash_algo field in the header extension [Stefan]
Fix qcow2 spec [Eric]
Remove pointer to hash and simplify hash memory management [Stefan]
Rename and move qcow2_read_cluster_data to qcow2.c [Stefan]
Document lock dropping behaviour of the previous function [Stefan]
cleanup qcow2_dedup_read_missing_cluster_data [Stefan]
rename *_offset to *_sect [Stefan]
add a ./configure check for ssl [Stefan]
Replace openssl by gnutls [Stefan]
Implement Skein hashes
Rewrite pretty every qcow2-dedup.c commits after Add
   qcow2_dedup_read_missing_and_concatenate to simplify the code
Use 64KB deduplication hash block to reduce allocation flushes
Use 64KB l2 tables to reduce allocation flushes [breaks compatibility]
Use lazy refcounts to avoid qcow2_cache_set_dependency loops resultings
   in frequent caches flushes
Do not create and load dedup RAM structures when bdrs-read_only is true

v3: make it work barely
replace kernel red black trees by gtree.

*** BLURB HERE ***

Benoît Canet (30):
  qcow2: Add deduplication to the qcow2 specification.
  qcow2: Add deduplication structures and fields.
  qcow2: Add qcow2_dedup_read_missing_and_concatenate
  qcow2: Make update_refcount public.
  qcow2: Create a way to link to l2 tables when deduplicating.
  qcow2: Add qcow2_dedup and related functions
  qcow2: Add qcow2_dedup_store_new_hashes.
  qcow2: Implement qcow2_compute_cluster_hash.
  qcow2: Extract qcow2_dedup_grow_table
  qcow2: Add qcow2_dedup_grow_table and use it.
  qcow2: create function to load deduplication hashes at startup.
  qcow2: Load and save deduplication table header extension.
  qcow2: Extract qcow2_do_table_init.
  qcow2-cache: Allow to choose table size at creation.
  qcow2: Add qcow2_dedup_init and qcow2_dedup_close.
  qcow2: Extract qcow2_add_feature and qcow2_remove_feature.
  block: Add qemu-img dedup create option.
  qcow2: Behave correctly when refcount reach 0 or 2^16.
  qcow2: Integrate deduplication in qcow2_co_writev loop.
  qcow2: Serialize write requests when deduplication is activated.
  qcow2: Add verification of dedup table.
  qcow2: Adapt checking of QCOW_OFLAG_COPIED for dedup.
  qcow2: Add check_dedup_l2 in order to check l2 of dedup table.
  qcow2: Do not overwrite existing entries with QCOW_OFLAG_COPIED.
  qcow2: Integrate SKEIN hash algorithm in deduplication.
  qcow2: Add lazy refcounts to deduplication to prevent
qcow2_cache_set_dependency loops
  qcow2: Use large L2 table for deduplication.
  qcow: Set dedup cluster block size to 64KB.
  qcow2: init and cleanup deduplication.
  qemu-iotests: Filter dedup=on/off so existing tests don't break.

 block/Makefile.objs  |1 +
 block/qcow2-cache.c  |   12 +-
 block/qcow2-cluster.c|  116 +++--
 block/qcow2-dedup.c  | 1157 ++
 block/qcow2-refcount.c   |  157 --
 block/qcow2.c|  357 +++--
 block/qcow2.h|  120 -
 configure|   55 ++
 docs/specs/qcow2.txt |  100 +++-
 include/block/block_int.h|1 +
 tests/qemu-iotests/common.rc |3 +-
 11 files changed, 1955 insertions(+), 124 deletions(-)
 create mode 100644 block/qcow2-dedup.c

-- 
1.7.10.4




Re: [Qemu-devel] [RFC V4 00/30] QCOW2 deduplication

2013-01-02 Thread Benoît Canet
 How does this code handle hash collisions, and do you have some regression
 tests that purposefully create a dedup hash collision, and verify that the
 'right thing' happens?

The two hash function that can be used are cryptographics and not broken yet.
So nobody knows how to generate a collision.

You can do the math to calculate the probability of collision using a 256 bit
hash while processing 1EiB of data the result is so low you can consider it
won't happen.
The sha256 ZFS deduplication works the same way regarding collisions.

I currently use qemu-io-test for testing purpose and iozone with the -w flag in
the guest.
I would like to find a good deduplication stress test to run in a guest.

Regards

Benoît

 It's great that this almost works, but it seems rather dangerous to put
 something like this into the mainline code without some regression tests.
 
 (I'm also suspecting the regression test will be a great way to find 
 flakey hardware)
 
 --
 Troy Benjegerdes'da hozer' ho...@hozed.org
 
 Somone asked my why I work on this free (http://www.fsf.org/philosophy/)
 software  hardware (http://q3u.be) stuff and not get a real job.
 Charles Shultz had the best answer:
 
 Why do musicians compose symphonies and poets write poems? They do it
 because life wouldn't have any meaning for them if they didn't. That's why
 I draw cartoons. It's my life. -- Charles Shultz



Re: [Qemu-devel] [RFC V4 00/30] QCOW2 deduplication

2013-01-02 Thread Troy Benjegerdes
On Wed, Jan 02, 2013 at 05:16:03PM +0100, Beno??t Canet wrote:
 This patchset is a cleanup of the previous QCOW2 deduplication rfc.
 
 One can compile and install https://github.com/wernerd/Skein3Fish and use the
 --enable-skein-dedup configure option in order to use the faster skein HASH.
 
 Images must be created with -o dedup=[skein|sha256] in order to activate the
 deduplication in the image.
 
 Deduplication is now fast enough to be usable.

How does this code handle hash collisions, and do you have some regression
tests that purposefully create a dedup hash collision, and verify that the
'right thing' happens?

The next question is .. what's the right thing?

It's great that this almost works, but it seems rather dangerous to put
something like this into the mainline code without some regression tests.

(I'm also suspecting the regression test will be a great way to find 
flakey hardware)

--
Troy Benjegerdes'da hozer' ho...@hozed.org

Somone asked my why I work on this free (http://www.fsf.org/philosophy/)
software  hardware (http://q3u.be) stuff and not get a real job.
Charles Shultz had the best answer:

Why do musicians compose symphonies and poets write poems? They do it
because life wouldn't have any meaning for them if they didn't. That's why
I draw cartoons. It's my life. -- Charles Shultz



Re: [Qemu-devel] [RFC V4 00/30] QCOW2 deduplication

2013-01-02 Thread Eric Blake
On 01/02/2013 10:33 AM, Benoît Canet wrote:
 How does this code handle hash collisions, and do you have some regression
 tests that purposefully create a dedup hash collision, and verify that the
 'right thing' happens?
 
 The two hash function that can be used are cryptographics and not broken yet.
 So nobody knows how to generate a collision.

I can understand that it is hard to write a test for two distinct data
sectors hashing to the same value, but perhaps it's worth including a
debug-only hash algorithm that intentionally generates collisions, just
to prove that you handle them correctly.  De-duplicating collided data,
while unlikely, is still a case of data loss that not everyone is happy
to risk.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [RFC V4 00/30] QCOW2 deduplication

2013-01-02 Thread Benoît Canet

I think I can easily add a verify option at image creation.
This way the code would read the cluster already on disk and compare it with
the cluster to write.
If there are different it would print some debug message and return -EIO to the
upper layers.

 Le Wednesday 02 Jan 2013 à 11:01:04 (-0700), Eric Blake a écrit :
 On 01/02/2013 10:33 AM, Benoît Canet wrote:
  How does this code handle hash collisions, and do you have some regression
  tests that purposefully create a dedup hash collision, and verify that the
  'right thing' happens?
  
  The two hash function that can be used are cryptographics and not broken 
  yet.
  So nobody knows how to generate a collision.
 
 I can understand that it is hard to write a test for two distinct data
 sectors hashing to the same value, but perhaps it's worth including a
 debug-only hash algorithm that intentionally generates collisions, just
 to prove that you handle them correctly.  De-duplicating collided data,
 while unlikely, is still a case of data loss that not everyone is happy
 to risk.
 
 -- 
 Eric Blake   eblake redhat com+1-919-301-3266
 Libvirt virtualization library http://libvirt.org
 





Re: [Qemu-devel] [RFC V4 00/30] QCOW2 deduplication

2013-01-02 Thread Troy Benjegerdes
The probability may be 'low' but it is not zero. Just because it's
hard to calculate the hash doesn't mean you can't do it. If your
input data is not random the probability of a hash collision is
going to get scewed.

Read about how Bitcoin uses hashes.

I need a budget of around $10,000 or so for some FPGAs and/or GPU cards,
and I can make a regression test that will create deduplication hash
collisions on purpose.


On Wed, Jan 02, 2013 at 06:33:24PM +0100, Beno?t Canet wrote:
  How does this code handle hash collisions, and do you have some regression
  tests that purposefully create a dedup hash collision, and verify that the
  'right thing' happens?
 
 The two hash function that can be used are cryptographics and not broken yet.
 So nobody knows how to generate a collision.
 
 You can do the math to calculate the probability of collision using a 256 bit
 hash while processing 1EiB of data the result is so low you can consider it
 won't happen.
 The sha256 ZFS deduplication works the same way regarding collisions.
 
 I currently use qemu-io-test for testing purpose and iozone with the -w flag 
 in
 the guest.
 I would like to find a good deduplication stress test to run in a guest.
 
 Regards
 
 Beno?t
 
  It's great that this almost works, but it seems rather dangerous to put
  something like this into the mainline code without some regression tests.
  
  (I'm also suspecting the regression test will be a great way to find 
  flakey hardware)
  
  --
  Troy Benjegerdes'da hozer' ho...@hozed.org
  
  Somone asked my why I work on this free (http://www.fsf.org/philosophy/)
  software  hardware (http://q3u.be) stuff and not get a real job.
  Charles Shultz had the best answer:
  
  Why do musicians compose symphonies and poets write poems? They do it
  because life wouldn't have any meaning for them if they didn't. That's why
  I draw cartoons. It's my life. -- Charles Shultz

-- 
--
Troy Benjegerdes'da hozer' ho...@hozed.org

Somone asked my why I work on this free (http://www.fsf.org/philosophy/)
software  hardware (http://q3u.be) stuff and not get a real job.
Charles Shultz had the best answer:

Why do musicians compose symphonies and poets write poems? They do it
because life wouldn't have any meaning for them if they didn't. That's why
I draw cartoons. It's my life. -- Charles Shultz



Re: [Qemu-devel] [RFC V4 00/30] QCOW2 deduplication

2013-01-02 Thread Benoît Canet
Le Wednesday 02 Jan 2013 à 12:26:37 (-0600), Troy Benjegerdes a écrit :
 The probability may be 'low' but it is not zero. Just because it's
 hard to calculate the hash doesn't mean you can't do it. If your
 input data is not random the probability of a hash collision is
 going to get scewed.
 
 Read about how Bitcoin uses hashes.
 
 I need a budget of around $10,000 or so for some FPGAs and/or GPU cards,
 and I can make a regression test that will create deduplication hash
 collisions on purpose.

It's not a problem as Eric pointed out while reviewing the previous patchset
there is a small place left with zeroes on the deduplication block.
A bit could be set on it when a collision is detected and an offset could point
to a cluster used to resolve collisions.

 
 
 On Wed, Jan 02, 2013 at 06:33:24PM +0100, Beno?t Canet wrote:
   How does this code handle hash collisions, and do you have some regression
   tests that purposefully create a dedup hash collision, and verify that the
   'right thing' happens?
  
  The two hash function that can be used are cryptographics and not broken 
  yet.
  So nobody knows how to generate a collision.
  
  You can do the math to calculate the probability of collision using a 256 
  bit
  hash while processing 1EiB of data the result is so low you can consider it
  won't happen.
  The sha256 ZFS deduplication works the same way regarding collisions.
  
  I currently use qemu-io-test for testing purpose and iozone with the -w 
  flag in
  the guest.
  I would like to find a good deduplication stress test to run in a guest.
  
  Regards
  
  Beno?t
  
   It's great that this almost works, but it seems rather dangerous to put
   something like this into the mainline code without some regression tests.
   
   (I'm also suspecting the regression test will be a great way to find 
   flakey hardware)
   
   --
   Troy Benjegerdes'da hozer' ho...@hozed.org
   
   Somone asked my why I work on this free (http://www.fsf.org/philosophy/)
   software  hardware (http://q3u.be) stuff and not get a real job.
   Charles Shultz had the best answer:
   
   Why do musicians compose symphonies and poets write poems? They do it
   because life wouldn't have any meaning for them if they didn't. That's why
   I draw cartoons. It's my life. -- Charles Shultz
 
 -- 
 --
 Troy Benjegerdes'da hozer' ho...@hozed.org
 
 Somone asked my why I work on this free (http://www.fsf.org/philosophy/)
 software  hardware (http://q3u.be) stuff and not get a real job.
 Charles Shultz had the best answer:
 
 Why do musicians compose symphonies and poets write poems? They do it
 because life wouldn't have any meaning for them if they didn't. That's why
 I draw cartoons. It's my life. -- Charles Shultz
 



Re: [Qemu-devel] [RFC V4 00/30] QCOW2 deduplication

2013-01-02 Thread ronnie sahlberg
Do you really need to resolve the conflicts?
It might be easier and sufficient to just flag those hashes where a
conflict has been detected as : dont dedup this hash anymore,
collissions have been seen.


On Wed, Jan 2, 2013 at 10:40 AM, Benoît Canet benoit.ca...@irqsave.net wrote:
 Le Wednesday 02 Jan 2013 à 12:26:37 (-0600), Troy Benjegerdes a écrit :
 The probability may be 'low' but it is not zero. Just because it's
 hard to calculate the hash doesn't mean you can't do it. If your
 input data is not random the probability of a hash collision is
 going to get scewed.

 Read about how Bitcoin uses hashes.

 I need a budget of around $10,000 or so for some FPGAs and/or GPU cards,
 and I can make a regression test that will create deduplication hash
 collisions on purpose.

 It's not a problem as Eric pointed out while reviewing the previous patchset
 there is a small place left with zeroes on the deduplication block.
 A bit could be set on it when a collision is detected and an offset could 
 point
 to a cluster used to resolve collisions.



 On Wed, Jan 02, 2013 at 06:33:24PM +0100, Beno?t Canet wrote:
   How does this code handle hash collisions, and do you have some 
   regression
   tests that purposefully create a dedup hash collision, and verify that 
   the
   'right thing' happens?
 
  The two hash function that can be used are cryptographics and not broken 
  yet.
  So nobody knows how to generate a collision.
 
  You can do the math to calculate the probability of collision using a 256 
  bit
  hash while processing 1EiB of data the result is so low you can consider it
  won't happen.
  The sha256 ZFS deduplication works the same way regarding collisions.
 
  I currently use qemu-io-test for testing purpose and iozone with the -w 
  flag in
  the guest.
  I would like to find a good deduplication stress test to run in a guest.
 
  Regards
 
  Beno?t
 
   It's great that this almost works, but it seems rather dangerous to put
   something like this into the mainline code without some regression tests.
  
   (I'm also suspecting the regression test will be a great way to find
   flakey hardware)
  
   --
   Troy Benjegerdes'da hozer' 
   ho...@hozed.org
  
   Somone asked my why I work on this free (http://www.fsf.org/philosophy/)
   software  hardware (http://q3u.be) stuff and not get a real job.
   Charles Shultz had the best answer:
  
   Why do musicians compose symphonies and poets write poems? They do it
   because life wouldn't have any meaning for them if they didn't. That's 
   why
   I draw cartoons. It's my life. -- Charles Shultz

 --
 --
 Troy Benjegerdes'da hozer' ho...@hozed.org

 Somone asked my why I work on this free (http://www.fsf.org/philosophy/)
 software  hardware (http://q3u.be) stuff and not get a real job.
 Charles Shultz had the best answer:

 Why do musicians compose symphonies and poets write poems? They do it
 because life wouldn't have any meaning for them if they didn't. That's why
 I draw cartoons. It's my life. -- Charles Shultz





Re: [Qemu-devel] [RFC V4 00/30] QCOW2 deduplication

2013-01-02 Thread Benoît Canet
Le Wednesday 02 Jan 2013 à 10:47:48 (-0800), ronnie sahlberg a écrit :
 Do you really need to resolve the conflicts?
 It might be easier and sufficient to just flag those hashes where a
 conflict has been detected as : dont dedup this hash anymore,
 collissions have been seen.

True, that's more elegant.
The user would still need to specify the verify option at creation
and it would require to do a read before verify but it would not make
the qcow2 format uglier.

 
 
 On Wed, Jan 2, 2013 at 10:40 AM, Benoît Canet benoit.ca...@irqsave.net 
 wrote:
  Le Wednesday 02 Jan 2013 à 12:26:37 (-0600), Troy Benjegerdes a écrit :
  The probability may be 'low' but it is not zero. Just because it's
  hard to calculate the hash doesn't mean you can't do it. If your
  input data is not random the probability of a hash collision is
  going to get scewed.
 
  Read about how Bitcoin uses hashes.
 
  I need a budget of around $10,000 or so for some FPGAs and/or GPU cards,
  and I can make a regression test that will create deduplication hash
  collisions on purpose.
 
  It's not a problem as Eric pointed out while reviewing the previous patchset
  there is a small place left with zeroes on the deduplication block.
  A bit could be set on it when a collision is detected and an offset could 
  point
  to a cluster used to resolve collisions.
 
 
 
  On Wed, Jan 02, 2013 at 06:33:24PM +0100, Beno?t Canet wrote:
How does this code handle hash collisions, and do you have some 
regression
tests that purposefully create a dedup hash collision, and verify that 
the
'right thing' happens?
  
   The two hash function that can be used are cryptographics and not broken 
   yet.
   So nobody knows how to generate a collision.
  
   You can do the math to calculate the probability of collision using a 
   256 bit
   hash while processing 1EiB of data the result is so low you can consider 
   it
   won't happen.
   The sha256 ZFS deduplication works the same way regarding collisions.
  
   I currently use qemu-io-test for testing purpose and iozone with the -w 
   flag in
   the guest.
   I would like to find a good deduplication stress test to run in a guest.
  
   Regards
  
   Beno?t
  
It's great that this almost works, but it seems rather dangerous to put
something like this into the mainline code without some regression 
tests.
   
(I'm also suspecting the regression test will be a great way to find
flakey hardware)
   
--
Troy Benjegerdes'da hozer' 
ho...@hozed.org
   
Somone asked my why I work on this free 
(http://www.fsf.org/philosophy/)
software  hardware (http://q3u.be) stuff and not get a real job.
Charles Shultz had the best answer:
   
Why do musicians compose symphonies and poets write poems? They do it
because life wouldn't have any meaning for them if they didn't. That's 
why
I draw cartoons. It's my life. -- Charles Shultz
 
  --
  --
  Troy Benjegerdes'da hozer' ho...@hozed.org
 
  Somone asked my why I work on this free (http://www.fsf.org/philosophy/)
  software  hardware (http://q3u.be) stuff and not get a real job.
  Charles Shultz had the best answer:
 
  Why do musicians compose symphonies and poets write poems? They do it
  because life wouldn't have any meaning for them if they didn't. That's why
  I draw cartoons. It's my life. -- Charles Shultz
 
 



Re: [Qemu-devel] [RFC V4 00/30] QCOW2 deduplication

2013-01-02 Thread Troy Benjegerdes
If you do get a hash collision, it's a rather exceptional event, so I'd 
say every effort should be made to log the event and the data that created
it in multiple places.

There are three questions I'd ask on a hash collision:

1) was it the data?
2) was it the hardware?
3) was it a software bug?

On Wed, Jan 02, 2013 at 10:47:48AM -0800, ronnie sahlberg wrote:
 Do you really need to resolve the conflicts?
 It might be easier and sufficient to just flag those hashes where a
 conflict has been detected as : dont dedup this hash anymore,
 collissions have been seen.
 
 
 On Wed, Jan 2, 2013 at 10:40 AM, Beno?t Canet benoit.ca...@irqsave.net 
 wrote:
  Le Wednesday 02 Jan 2013 ? 12:26:37 (-0600), Troy Benjegerdes a ?crit :
  The probability may be 'low' but it is not zero. Just because it's
  hard to calculate the hash doesn't mean you can't do it. If your
  input data is not random the probability of a hash collision is
  going to get scewed.
 
  Read about how Bitcoin uses hashes.
 
  I need a budget of around $10,000 or so for some FPGAs and/or GPU cards,
  and I can make a regression test that will create deduplication hash
  collisions on purpose.
 
  It's not a problem as Eric pointed out while reviewing the previous patchset
  there is a small place left with zeroes on the deduplication block.
  A bit could be set on it when a collision is detected and an offset could 
  point
  to a cluster used to resolve collisions.
 
 
 
  On Wed, Jan 02, 2013 at 06:33:24PM +0100, Beno?t Canet wrote:
How does this code handle hash collisions, and do you have some 
regression
tests that purposefully create a dedup hash collision, and verify that 
the
'right thing' happens?
  
   The two hash function that can be used are cryptographics and not broken 
   yet.
   So nobody knows how to generate a collision.
  
   You can do the math to calculate the probability of collision using a 
   256 bit
   hash while processing 1EiB of data the result is so low you can consider 
   it
   won't happen.
   The sha256 ZFS deduplication works the same way regarding collisions.
  
   I currently use qemu-io-test for testing purpose and iozone with the -w 
   flag in
   the guest.
   I would like to find a good deduplication stress test to run in a guest.
  
   Regards
  
   Beno?t
  
It's great that this almost works, but it seems rather dangerous to put
something like this into the mainline code without some regression 
tests.
   
(I'm also suspecting the regression test will be a great way to find
flakey hardware)
   
--
Troy Benjegerdes'da hozer' 
ho...@hozed.org
   
Somone asked my why I work on this free 
(http://www.fsf.org/philosophy/)
software  hardware (http://q3u.be) stuff and not get a real job.
Charles Shultz had the best answer:
   
Why do musicians compose symphonies and poets write poems? They do it
because life wouldn't have any meaning for them if they didn't. That's 
why
I draw cartoons. It's my life. -- Charles Shultz
 
  --
  --
  Troy Benjegerdes'da hozer' ho...@hozed.org
 
  Somone asked my why I work on this free (http://www.fsf.org/philosophy/)
  software  hardware (http://q3u.be) stuff and not get a real job.
  Charles Shultz had the best answer:
 
  Why do musicians compose symphonies and poets write poems? They do it
  because life wouldn't have any meaning for them if they didn't. That's why
  I draw cartoons. It's my life. -- Charles Shultz
 
 

-- 
--
Troy Benjegerdes'da hozer' ho...@hozed.org

Somone asked my why I work on this free (http://www.fsf.org/philosophy/)
software  hardware (http://q3u.be) stuff and not get a real job.
Charles Shultz had the best answer:

Why do musicians compose symphonies and poets write poems? They do it
because life wouldn't have any meaning for them if they didn't. That's why
I draw cartoons. It's my life. -- Charles Shultz



Re: [Qemu-devel] [RFC V4 00/30] QCOW2 deduplication

2013-01-02 Thread ronnie sahlberg
On Wed, Jan 2, 2013 at 11:18 AM, Troy Benjegerdes ho...@hozed.org wrote:
 If you do get a hash collision, it's a rather exceptional event, so I'd
 say every effort should be made to log the event and the data that created
 it in multiple places.

 There are three questions I'd ask on a hash collision:

 1) was it the data?
 2) was it the hardware?
 3) was it a software bug?

Yes, that is probably good too, and saving off the old and new block
content that collided.

Unless you are checksumming the blocks, I suspect that the most common
reason for collisions would just be cases where the original block
was corrupted/changed on disk and you dont detect it and then when you
re-write an identical one the blocks no longer match and thus you get
a false collision.