Shachar-
True enough - with one
additional thought - if the block size is
set to be the square root of the file size,
then the load factor on the hash table becomes
dynamic in and of itself (bigger block size
= less master table entries = fewer hash
collisions).
In the case of relatively
low bandwidth connections, you will get MUCH
better performance improvement by messing
with the block size than the size of hash
table, becaues the hash table isn't sent
over the wire - the block table IS sent over
the wire, so reducing it's size can have
a big impact on performance if your file
isn't changing much.
In Andrew's original thesis,
he looked at several very large files, and
found that the combined size of transfering
the block table AND the changes winds up
being pretty constant - but there was an
optimization that could be performed (the
original block size of 700 came from this
analysis). There is some theoretical
calculation (with quite a bit of hand waving)
that indicates that the optimal block size
is the square root of the overall file size
- but I haven't seen any parametric studies
that confirm this. It may be that you
need to fiddle with the block size and see
what impact it has. The problem you
face, of course, is that running a parametric
test with a file that big could be pretty
slow. I have done a bunch of parametric
testing, but I use my own implementation
of the algorithm so I don't have to actually
send all the data over the wire, etc... which
speeds up my tests.
You may be able to piggy
back off of another thread in this listserv
- they are discussing adding a flag that
tells rsync to create the block table and
save it to file, and another that uses that
saved block table and computes a delta that
it then saves to file. If you run these
locally, you may be able to get some parametric
results on your large file sizes without
it taking forever.
Trying to increase the
size of the hash table may just not be worth
it - are you certain that the performance
hit you are experiencing is caused by processing
on the recipient side, and not data transfer
of the block table? In my testing (which
is actually with my own implementation of
the algorithm, so I may have optimizations/
or lack thereof compared to the rsync you
are running), the block table transfer is
the biggest cause of elapsed time for big
files that don't change much.
It may be that the square
root of file size for block size isn't appropriate
for files as big as you are working with...
- K
Original Message
--------------------
--------------------
|
-- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html