re[2]: Rsyncing really large files

2005-03-09 Thread Kevin Day
Shachar- I think Wayne is mostly pointing you to the correct location here. If you look at the code where the checksum is computer, you'll find that the rolling checksum actually consists of two parts - one is the sum of all of the byte values in the window, the other is the offset

Re: Rsyncing really large files

2005-03-09 Thread Shachar Shemesh
Kevin Day wrote: Shachar- I think Wayne is mostly pointing you to the correct location here. If you look at the code where the checksum is computer, you'll find that the rolling checksum actually consists of two parts - one is the sum of all of the byte values in the window, the other is the

Re: Rsyncing really large files

2005-03-05 Thread Shachar Shemesh
Lars Karlslund wrote: And I'm suggesting making it static, by adjusting the hash table's size according to the number of blocks. Just do hashtablesize=(numblocks/8+1)*10;, and you should be set. Or maybe it should really be dynamic. I'm talking about the hash table load. I.e. - the ratio

Re: Rsyncing really large files

2005-03-05 Thread Shachar Shemesh
Kevin Day wrote: As a quick FYI, the block size absolutely has an impact on the effectiveness of the checksum table - larger blocks means fewer blocks, which means fewer hash colissions. Since you wouldn't expect that many blocks in the file, a 32bit weak checksum would only produce about 4 or

Re: Rsyncing really large files

2005-03-05 Thread Wayne Davison
On Sat, Mar 05, 2005 at 02:18:01PM +0200, Shachar Shemesh wrote: That's not what I read into it. It seems to me that the checksum function gives a 32bit result, and we are squashing that into a 16bit hash table. Can you point me to the code? Wayne? Look in match.c. The hash table is

Re: Rsyncing really large files

2005-03-05 Thread Wayne Davison
On Thu, Mar 03, 2005 at 10:18:01AM +0200, Shachar Shemesh wrote: And I'm suggesting making it static, by adjusting the hash table's size according to the number of blocks. The block-size? No thanks. The smaller the block-size for a particular file size, the more checksum data needs to be

Re: Rsyncing really large files

2005-03-05 Thread Wayne Davison
[I accidentally sent this reply only to Lars instead of the list.] On Fri, Mar 04, 2005 at 06:44:48PM +0100, Lars Karlslund wrote: I read somewhere about the default blocksize of 700 bytes. Now I'm told the blocksize is calculated automatically. Which is it? 700 is the minimum block-size. The

Re: Rsyncing really large files

2005-03-05 Thread Shachar Shemesh
Wayne Davison wrote: On Thu, Mar 03, 2005 at 10:18:01AM +0200, Shachar Shemesh wrote: And I'm suggesting making it static, by adjusting the hash table's size according to the number of blocks. The block-size? Definitely not! I was talking about the hash table load. I.e. - the ratio

Re: Rsyncing really large files

2005-03-05 Thread Wayne Davison
Wayne Davison wrote: He's talking about potentially losing an even distribution of values if the lowest order bits aren't random enough. On Sat, Mar 05, 2005 at 08:10:37PM +0200, Shachar Shemesh wrote: Even for N which is not a power of 2? In the case of the weak checksum, it is computed

re[2]: Rsyncing really large files

2005-03-01 Thread Kevin Day
Shachar- True enough - with one additional thought - if the block size is set to be the square root of the file size, then the load factor on the hash table becomes dynamic in and of itself (bigger block size = less master table entries = fewer hash collisions). In the case of

Re: Rsyncing really large files

2005-02-28 Thread Lars Karlslund
Hi everyone, Thank you for your replies. On tor, 2005-02-24 at 17:59 +0100, Paul Slootman wrote: It would certainly be possible to change the algorithm to not cache the data (and thus only allow the current block to be compared), but I don't think that idea has general enough interest

Re: Rsyncing really large files

2005-02-28 Thread Shachar Shemesh
Lars Karlslund wrote: Also as far as I could read, the default block size is 700 bytes? What kind of application would default to moving data around 700 bytes at a time internally in a file? I'm not criticizing rsync, merely questioning the functionality of this feature. I believe you may have

Re: Rsyncing really large files

2005-02-28 Thread Lars Karlslund
On man, 2005-02-28 at 12:23 +0200, Shachar Shemesh wrote: Also as far as I could read, the default block size is 700 bytes? What kind of application would default to moving data around 700 bytes at a time internally in a file? I'm not criticizing rsync, merely questioning the

Re: Rsyncing really large files

2005-02-28 Thread Shachar Shemesh
Lars Karlslund wrote: Maybe I didn't express myself thoroughly enough :-) Or me. Yes, a block is a minimum storage unit, which is considered for transfer. In size, yes. Not in position. But it's a fact that the rsync algorithm as it is now checks to see if a block should have moved. And in that

Re: Rsyncing really large files

2005-02-28 Thread Shachar Shemesh
Shachar Shemesh wrote: No, because the rsync algorithm can detect single byte moves of this 700 bytes block. I will just mention that I opened the ultimate documentation for rsync (the source), and it says that the default block size is the rounded square root of the file's size. This means

Re: Rsyncing really large files

2005-02-28 Thread Shachar Shemesh
Wayne Davison wrote: However, you should be sure to have measured what is causing the slowdown first to know how much that will help. If it is not memory that is swapping on the sender, it may be that the computing of the checksums in maxing out your CPU, and removing the caching of the remote

re[2]: Rsyncing really large files

2005-02-28 Thread Kevin Day
Shachar,It does use a hash table. rsync adds the two components of the rolling checksum together to come up with a 16 bit hash, and performs a table lookup. The table contains an offset into the sorted rsync checksum table (which contains both weak and strong checksums), and does a

Re: Rsyncing really large files

2005-02-28 Thread Wayne Davison
On Mon, Feb 28, 2005 at 08:33:52PM +0200, Shachar Shemesh wrote: If so, we can probably make it much much (much much much) more efficient by using a hash table instead. That's what tag_table is -- it's an array of 65536 pointers into the checksum data sorted by weak checksum. The code then

Re: Rsyncing really large files

2005-02-28 Thread Craig Barratt
Lars Karlslund writes: Also the numbers speak for themselves, as the --whole-file option is *way* faster than the block-copy method on our setup. At the risk of jumping into the middle of this thread without remembering everything that was discussed... Remember that by default rsync writes a

Re: Rsyncing really large files

2005-02-28 Thread Shachar Shemesh
Kevin Day wrote: I would *strongly* recommend that you dig into the thesis a bit (just the section that describes the rsync algorithm itself). I tried a few weeks ago. I started to print it, and my printer ran out of ink :-). I will read it electronically eventually (I hope). Now, if you have

Re: Rsyncing really large files

2005-02-24 Thread Paul Slootman
On Thu 24 Feb 2005, Wayne Davison wrote: It would certainly be possible to change the algorithm to not cache the data (and thus only allow the current block to be compared), but I don't think that idea has general enough interest for me to work on for inclusion in rsync. You might want to