On 08/21/2014 08:10 AM, Bernhard Voelker wrote: > On 08/11/2014 03:55 PM, Rasmus Borup Hansen wrote: >> Trusting that resizing the hash table would eventually finish, the cp >> command was allowed to continue, and after a while it started copying >> again. It stopped again and resized the hash table a couple of times, >> each taking more and more time. Finally, after 10 days of copying and >> hash table resizing, the new file system used as many blocks and inodes >> as the old one according to df, but to my surprise the cp command didn't >> exit. Looking at the source again, I found that cp disassembles its hash >> table data structures nicely after copying (the forget_all call). Since >> the virtual size of the cp process was now more than 17 GB and the >> server only had 10 GB of RAM, it did a lot of swapping. > > Thinking about this case again, I find this very surprising: > > a) that cp(1) uses 17 GB of memory when copying 39 TB of data. > That means roughly 2300 bytes per file: > > $ bc <<<'39 * 1024 / 17' > 2349 > > ... although the hashed structure only has these members: > > struct Src_to_dest > { > ino_t st_ino; > dev_t st_dev; > char *name; > }; > > I think either the file names where rather long (in average!), > or there is something wrong in the code. > > b) that cp(1) is increasing the hash table that often. > This is because it uses the default Hash_tuning (hash.c): > > /* [...] The growth threshold defaults to 0.8, and the growth factor > defaults to 1.414, meaning that the table will have doubled its size > every second time 80% of the buckets get used. */ > #define DEFAULT_GROWTH_THRESHOLD 0.8f > #define DEFAULT_GROWTH_FACTOR 1.414f > > It is like this since the introduction of hashing, and > I wonder if cp(1) couldn't use better values for this. > > Have a nice day, > Berny >
The amount of files rather than the amount of data is pertinent here. So 17G/432M is about 40 bytes per entry which is about right. cheers, Pádraig.