Shachar-
 
True enough - with one additional thought - if the block size is set to be the square root of the file size, then the load factor on the hash table becomes dynamic in and of itself (bigger block size = less master table entries = fewer hash collisions).
 
In the case of relatively low bandwidth connections, you will get MUCH better performance improvement by messing with the block size than the size of hash table, becaues the hash table isn't sent over the wire - the block table IS sent over the wire, so reducing it's size can have a big impact on performance if your file isn't changing much.
 
In Andrew's original thesis, he looked at several very large files, and found that the combined size of transfering the block table AND the changes winds up being pretty constant - but there was an optimization that could be performed (the original block size of 700 came from this analysis).  There is some theoretical calculation (with quite a bit of hand waving) that indicates that the optimal block size is the square root of the overall file size - but I haven't seen any parametric studies that confirm this.  It may be that you need to fiddle with the block size and see what impact it has.  The problem you face, of course, is that running a parametric test with a file that big could be pretty slow.  I have done a bunch of parametric testing, but I use my own implementation of the algorithm so I don't have to actually send all the data over the wire, etc... which speeds up my tests.
 
You may be able to piggy back off of another thread in this listserv - they are discussing adding a flag that tells rsync to create the block table and save it to file, and another that uses that saved block table and computes a delta that it then saves to file.  If you run these locally, you may be able to get some parametric results on your large file sizes without it taking forever.
 
Trying to increase the size of the hash table may just not be worth it - are you certain that the performance hit you are experiencing is caused by processing on the recipient side, and not data transfer of the block table?  In my testing (which is actually with my own implementation of the algorithm, so I may have optimizations/ or lack thereof compared to the rsync you are running), the block table transfer is the biggest cause of elapsed time for big files that don't change much.
 
It may be that the square root of file size for block size isn't appropriate for files as big as you are working with...
 
- K
 
 
 
Original Message
--------------------
> Kevin Day wrote:

> I would *strongly* recommend that you dig into the thesis a bit (just
> the section that describes the rsync algorithm itself).

I tried a few weeks ago. I started to print it, and my printer ran out
of ink :-). I will read it electronically eventually (I hope).

> Now, if you have huge files, then the 16 bit checksum may not be
> sufficient to keep the hit rate down.  At that point, you may want to
> consider a few algorithmic alternatives:
>
> 1.  Break up your file into logical blocks and rsync each block
> individually.  If the file is an "append only" file, then this is
> fine.  However, if the contents of the file get re-ordered across
> block boundaries, then the efficiency of the rsync algorithm would be
> seriously degraded.
>
> 2.  Use a larger hash table.  Instead of 16 bits, expand it to 20 bits
> - it will require 16 times as much memory for the hash table, but that
> may not be an issue for you - you are probably workring with some
> relatively beefy hardware anyway, so what the heck.

Now, here's the part where I don't get it. We have X blocks checksummed,
covering Y bytes each (actually, we have X blocks of checksum covering X
bytes each, but that's not important). This means we actually know,
before we get the list of checksums, how many we will have.

As for the hash table size - that's standard engineering. Alpha is
defined as the ratio between the number of used buckets in the table to
the number of total buckets. 0.8 is considered a good value.

What I can propose is to make the hash table size a function of X. If we
take Lars' case, he has 500GB file, which means you ideally need about 1
million buckets in the hash to have reasonable performance. We only have
65 thousand. His Alpha is 0.008. No wonder he is getting abysmal
performance.

On the other hand, if I'm syncing a 100K file, I'm only going to have
320 blocks. A good hash table size for me will be 400 buckets. Having
65536 buckets instead means I'm less likely to have memory cache hits,
and performance suffers again. My Alpha value is 204 (instead of 0.8).

If my proposal is accepted, we will be adaptive in CPU-memory trade off.

>  I'll leave the excercise of converting the full rsync 32 bit rolling
> checksum into a 20 bit value to you.

A simple modulo ought to be good enough. If the checksum is 42891 and I
have 320 buckets, it should go into bucket 11. Assuming the checksums
are sufficiently random-like, this algorithm is good enough.

> Cheers,
>
> Kevin

      Shachar

--
Shachar Shemesh
Lingnu Open Source Consulting ltd.
Have you backed up today's work? http://www.lingnu.com/backup.html

--
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

<
--------------------
-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

Reply via email to