On 06/10/10 02:21 PM, Pawel Jakub Dawidek wrote:
On Thu, Jun 10, 2010 at 12:21:02PM -0600, Lori Alt wrote:
It's not possible to implement unless we establish a bidirectional
communication between the sending and receiving side.  The logic for
send-stream dedup is:

for (each block to be written to stream) {

    get the block's checksum

    lookup the block's checksum in the dedup-table
         established for *this* stream generation

    if (an entry in the DDT exists for this checksum)

        send a "write-by-reference" block across the stream
             (this contains a reference to a block send earlier in the
        stream)

    else {

        add an entry for this block to the DDT

        send the full block

    }

}

Since the dedup table on sending side only knows about blocks already
send in the stream, we have no way of knowing whether a copy of the
block already exists on the other side, and even if we did know, we
wouldn't know where it was on the other side.  The sending side would
have to have a copy of the other side's on-disk DDT to know whether a
write-by-reference could be used.
If we send incremental stream we can be sure that up to the previous
snapshot we have the same data on the other side. I'm aware it doesn't
mean the data has exactly the same checksum (eg. it can be compressed
with different algorithm). But in theory, are we able to figure out that
the given block we try to send is already part of the dataset's previous
snapshot? I'm fine with discarding incremental stream on the remote site
if it uses different compression algorithm or simply deduplication is
turned off (bascially when there is no block matching stored checksum).
But if I have identical configurations on both ends I'd like not to send
the same block multiple times in multiple incremental streams.

Each incremental stream only contains the blocks that are new or changed since the last snapshot, so I don't see how you can be sure that the data already exists on the receiving side. But even if you did know that the block already exists on the receiving side, you don't know where it is. That is, you don't know what to put in the "reference" field of the send stream record. You don't know the object number and offset of where the block already exists on the receiving side.

Lori

_______________________________________________
zfs-code mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/zfs-code

Reply via email to