On Fri, Jun 12, 2009 at 12:54 AM, Ralf Wildenhues <[email protected]>wrote:
> * Jim Meyering wrote on Thu, May 28, 2009 at 09:33:21PM CEST: > > Glen Lenker wrote: > > > On Thu, Mar 26, 2009 at 09:50:08PM +0000, Ralf Wildenhues wrote: > > >> > > >> Example run, on an 8-way, and with cat'ed instances of the dictionary, > > >> on tmpfs, timings best of three: > > > > > > Hey Ralf, did you happen to specify the amount of RAM sort should > > > use. Not specifying enough RAM for sort would force break what would > > > be a single internal sort into multiple internal sort passes and then > > > an external sort. As it is external sort is still sequential. > > No, I did not specify the amount of RAM. The system I tested on has > plenty of RAM, way more than is needed for the sort. Specifying > something like 2G of RAM does not make any visible difference. > When testing on gcc16, 16G of RAM, sort chooses to use only 1G of RAM for a 256M file, and ends up hitting the external merge code. It was through experimentation with gprof that I found that anything less than 2G causes sort to do an external sort on gcc16. > > > > I ran these tests on a 256MB instance of the dictionary in tmpfs, on a > > > 8-core machine specifying 2G of RAM. > > > > > > runtime [s] threads > > > file size [MB] 1 2 4 8 > > > 256 2m41.219 1m27.357 52.53 36.429 > > > > 160.22user 1.34system 2:41.61elapsed 99%CPU > > > 159.83user 1.45system 1:27.12elapsed 185%CPU > > > 159.84user 1.56system 0:52.26elapsed 308%CPU > > > 160.67user 1.53system 0:36.26elapsed 447%CPU > > > > > > This seems to be what I would expect from a good implementation. > > Yes, 56% efficiency going from 1 to 8 threads sounds like a much better > number, also your system overhead looks very sane compared to what I > saw. Seems like it's a system-specific issue after all. Which Linux > kernel version and which pthread (glibc) version are you using? > This was on gcc16. Linux gcc16 2.6.18-6-vserver-amd64 #1 SMP Thu Dec 25 21:35:11 UTC 2008 x86_64 GNU/Linux glibc 2.3.6.ds1-13etch9 > > > >> It suggests to me that too much time is spent busy-waiting in > pthread_join, > > >> or that sort is computing too much (I haven't looked at the patch in > detail). > > >> > > >> Also, I'd have expected the rate going from 1 to 2 threads to get at > least > > >> a bit better with bigger file size, but it remains remarkably > constant, > > >> around 1.45 for this setup. What am I missing? > > Glen, could you look at this, too? I mean just timings of 1 vs 2 > threads for different file sizes? Thanks. > The data was again cat'ed instances of the dictionary, put onto tmpfs. I made sure that each time it had enough memory. 1st run memory 1 thread 2 threads speedup 16M 7.27 4.04 ~1.8 32M 16.15 9.19 ~1.75 64M 35.06 19.61 ~1.78 128M 75.47 40.99 ~1.84 256M 162.30 87.27 ~1.86 2nd run memory 1 thread 2 threads speedup 16M 7.27 4.05 ~1.8 32M 16.15 8.86 ~1.82 64M 35.07 19.18 ~1.83 128M 75.22 41.28 ~1.82 256M 161.76 87.77 ~1.84 > > I ran some tests (see below), and got results that look > > similar to Ralf's. I used an otherwise-idle 16-core Nehalem-based > > system with 8GB of RAM. > > Which kernel and glibc? > > > However, more importantly, while the 16- and 8-thread tests were > > running, I sampled thread counts using ps and was surprised to see > > the number of active threads was usually far less than the maximum. > > Well yes, I suppose you are seeing the serial overhead from reading in > the data set by the first process alone, and from the later merging > stages where only fewer threads are active. > > > T=16 T is number of threads > > 12.96 <--- This is elapsed wall-clock time in seconds. > > 16.43 > > 13.76 > > 16.60 > > That varies a lot. Are there other jobs on this system running? > Do you, Jim, also see high system time overhead? That would support > the hypothesis of a less efficient thread implementation (or kernel > issue) being the cause. > > Off-list, Glen wrote: > > I don't know if this is related, because up until now memory > > consumption hasn't been mentioned wrt to this patch, but this patch > > bumps the amount of memory consumed from 1.5x to 2x. Should we > > consider reducing this? > > I don't think fighting over that 0.5 x size is worth any hassles, > specifically as that x is just the number of lines times the size > of a struct line, not the size of the data in one line (right?). > > I was just concerned that the extra 0.5 x #lines x sizeof(struct line) would push the amount of memory needed over the default amount of memory allocated by sort. Ultimately an external merge where it would normally be necessary. I suppose that this shouldn't matter in the future as much as it matters now. On this subject, what exactly is the motivation of sort's default_sort_size()? Shouldn't it try to do an internal sort if possible? Glen _______________________________________________ Bug-coreutils mailing list [email protected] http://lists.gnu.org/mailman/listinfo/bug-coreutils
