Re: Multi-threading in sort(or core-utils)
On 16th June 2008, Paul Eggert wrote: Pádraig Brady [EMAIL PROTECTED] writes: 5. I think it would be nice for dd to support reading portions of a file efficiently. As far as I can see it can only do it by reading 1 byte at a time. Perhaps skip_bytes=x and count_bytes=x would be useful additions? These also sound like good suggestions. I just noticed in the info docs for dd that there is a method to achieve skip_bytes at least: (dd bs=1 skip=123456 count=0 dd bs=8192) file Pádraig. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: Multi-threading in sort(or core-utils)
James Youngman wrote: On Thu, Jun 26, 2008 at 1:22 AM, Bo Borgerson [EMAIL PROTECTED] wrote: If all inputs are regular files then SORTERS read directly rather than being fed by an extra process. Does that work with multi-byte character sets? Hi James, Each sorter's portion of input is delineated along line boundaries as detected by the main buffer-filling routine. I don't think any multi-byte character set problems should have been introduced. What type of issue specifically concerns you? I'm going to start putting together some tests soon, and I'd like to include a test case that would exercise the type of bug you have in mind. Thanks, Bo ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: Multi-threading in sort(or core-utils)
On Sat, Jun 28, 2008 at 4:27 PM, Bo Borgerson [EMAIL PROTECTED] wrote: Hi James, Each sorter's portion of input is delineated along line boundaries as detected by the main buffer-filling routine. I don't think any multi-byte character set problems should have been introduced. I see. I agree, then. What type of issue specifically concerns you? I'm going to start putting together some tests soon, and I'd like to include a test case that would exercise the type of bug you have in mind. Given that you had done away with the feeder, I had assumed that you'd just set things up so that for N sort workers on an M byte input, worker i would start at the next line boundary after byte offset (i*M/N). I was essentially wondering how you could accurately determine the actual line boundaries in a multibyte file.But I now understand that you are not doing things that way. James. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: Multi-threading in sort(or core-utils)
Bo Borgerson wrote: Cons: - It's limited by the feeder. If the sorters were able to read at their own pace I think this would scale better. - It uses N+2 processes. When sorters are run in parallel there are two helper processes, one feeding input and one merging output. Hello again. In light of the drawbacks mentioned for my previous parallel sort patch, I've made the attached modifications. If all inputs are regular files then SORTERS read directly rather than being fed by an extra process. This exhibits better performance with a concurrency of 2, but still does not realize the full benefit of greater concurrency that I was expecting: - $ for i in 0 1 2 3; do cat /dev/urandom | base64 | head -200 | cut -da -f1 p$i; done $ time ~/sort p0 p1 p2 p3 /dev/null real0m31.444s $ time ~/sort --concurrency=2 p0 p1 p2 p3 /dev/null real0m16.908s $ time ~/sort --concurrency=4 p0 p1 p2 p3 /dev/null real0m15.353s $ time ~/sort -m (~/sort p0 p1) (~/sort p2 p3) /dev/null real0m17.066s -- similar to --concurrency=2 $ time ~/sort -m (~/sort p0) (~/sort p1) (~/sort p2) (~/sort p3) /dev/null real0m10.832s -- _this_ is what I want from --concurrency=4! - Jim pointed out a mistake in my performance testing script that was causing me to use a smaller sample size for each data point than I intended. I've attached a version with his patch. Thanks, Bo inline: bench_output_server_2.pngFrom 298d6871aee1a1a506015681fb88ce5ba2b24644 Mon Sep 17 00:00:00 2001 From: Bo Borgerson [EMAIL PROTECTED] Date: Tue, 24 Jun 2008 14:02:22 -0400 Subject: [PATCH] o sort: Don't use a feeder for regular file concurrency. * src/sort.c (xlseek): Try to lseek, complain on failure. Stolen from src/tail.c. (fillbuf): If SORTER_BYTES_LEFT is non-negative, treat it as a limit. (sort): When running multiple sorters concurrently, if all inputs are regular files set sorters up to read directly rather than spawning a feeder to distribute work among them. --- src/sort.c | 220 --- 1 files changed, 194 insertions(+), 26 deletions(-) diff --git a/src/sort.c b/src/sort.c index 18a8882..4aa7609 100644 --- a/src/sort.c +++ b/src/sort.c @@ -307,6 +307,11 @@ static unsigned int nmerge = NMERGE_DEFAULT; their output through a pipe to the parent who will merge. */ static int sorter_output_fd = -1; +/* If multiple sorters are each reading their own input rather + than being fed by a single process then they'll have a cap + on how much they can read. */ +static size_t sorter_bytes_left = -1; + static void sortlines_temp (struct line *, size_t, struct line *); /* Report MESSAGE for FILE, then clean up and exit. @@ -831,6 +836,45 @@ xfclose (FILE *fp, char const *file) } } +/* Call lseek with the specified arguments, where file descriptor FD + corresponds to the file, FILENAME. + Give a diagnostic and exit nonzero if lseek fails. + Otherwise, return the resulting offset. + + This is stolen from src/tail.c */ + +static off_t +xlseek (int fd, off_t offset, int whence, char const *filename) +{ + off_t new_offset = lseek (fd, offset, whence); + char buf[INT_BUFSIZE_BOUND (off_t)]; + char *s; + + if (0 = new_offset) +return new_offset; + + s = offtostr (offset, buf); + switch (whence) +{ +case SEEK_SET: + error (0, errno, _(%s: cannot seek to offset %s), + filename, s); + break; +case SEEK_CUR: + error (0, errno, _(%s: cannot seek to relative offset %s), + filename, s); + break; +case SEEK_END: + error (0, errno, _(%s: cannot seek to end-relative offset %s), + filename, s); + break; +default: + abort (); +} + + exit (EXIT_FAILURE); +} + static void dup2_or_die (int oldfd, int newfd) { @@ -1556,7 +1600,8 @@ fillbuf (struct buffer *buf, FILE *fp, char const *file) size_t line_bytes = buf-line_bytes; size_t mergesize = merge_buffer_size - MIN_MERGE_BUFFER_SIZE; - if (buf-eof) + + if (buf-eof || 0 == sorter_bytes_left) return false; if (buf-used != buf-left) @@ -1582,9 +1627,23 @@ fillbuf (struct buffer *buf, FILE *fp, char const *file) rest of the input file consists entirely of newlines, except that the last byte is not a newline. */ size_t readsize = (avail - 1) / (line_bytes + 1); - size_t bytes_read = fread (ptr, 1, readsize, fp); - char *ptrlim = ptr + bytes_read; + size_t bytes_read; + char *ptrlim; char *p; + + if (0 sorter_bytes_left) + readsize = MIN (readsize, sorter_bytes_left); + + bytes_read = fread (ptr, 1, readsize, fp); + + if (0 sorter_bytes_left) + sorter_bytes_left -= bytes_read; + + /* This is a fake end-of-file for this sorter child. */ + if (0 == sorter_bytes_left) + buf-eof = true; + + ptrlim = ptr + bytes_read; avail -= bytes_read; if (bytes_read != readsize) @@ -2627,6 +2686,9 @@ sort (char * const *incoming_files, size_t nfiles,
Re: Multi-threading in sort(or core-utils)
Paul Eggert wrote: Bo Borgerson [EMAIL PROTECTED] writes: Does this sound like a step in the right direction for sort? If I were to clean this up and submit it would you be willing to assess its viability as a portable improvement? Yes, and yes. And thanks! Hi Paul, When I isolated my parallel merge enhancement I discovered that the improvement I was seeing was mostly the result of my not having properly divided resources (particularly NMERGE) among children. Aside from some benefit to unique merges with many duplicates I wasn't able to produce satisfactory results using this approach. So I started from scratch on a parallel bulk sort enhancement. Here I was able to see some modest but reliable improvement. The approach I took was to divide the main work among a number of children (sorters) by using an additional child (the feeder) to read inputs and distribute data among them. The parent then merges sorter output. This approach has some pros and cons in my view: Pros: - It's simple. Sorters don't need to worry about what their siblings are doing. They just process the data they're fed. - It doesn't require a known amount of data. Work is distributed among sorters by the feeder in small interleaved chunks. - It doesn't require regular files. Data coming through FIFOs or from sub-pipelines via process substitution is no problem. - It limits increased resource consumption. The feeder is the only process reading from disk until/unless the sorters need temporary files. NMERGE and SORT_BUFFER_SIZE are divided among sorters. Cons: - It's limited by the feeder. If the sorters were able to read at their own pace I think this would scale better. - It uses N+2 processes. When sorters are run in parallel there are two helper processes, one feeding input and one merging output. I've attached the results of some performance testing I did both on my laptop (dual core) and on a server (4x dual core, hyper-threaded == 16 processors visible). I included two graphs of the server results which I thought were interesting. Each line represents a level of concurrency. One graph shows time in seconds on the Y axis while the other shows percentage of single-process time. In both cases lower lines indicate better performance. As you can see, even on the 16 processor machine performance peaks at a low concurrency. I included the script I used for testing in case anyone else is interested and has a machine they're willing to run hot for a few hours. The attached patch is also available for fetch from git://repo.or.cz/coreutils/bo.git as branch 'sort'. I haven't included any tests or documentation in the patch yet. I was hoping to first get a sense of whether you and other more experienced coreutils developers consider this alternate approach to be worth pursuing. Thanks, Bo inline: bench_output_server_seconds.pnginline: bench_output_server_percentage.pngFrom e31f3f11a2d06079182ae7892e3af280dc4044cc Mon Sep 17 00:00:00 2001 From: Bo Borgerson [EMAIL PROTECTED] Date: Wed, 18 Jun 2008 09:59:46 -0400 Subject: [PATCH] sort: Add new option --concurrency=N. * src/sort.c (xfopen): Take an additional argument, FD. If FILE is NULL then fdopen FD instead. If FD is -1, use STDOUT as before. (specify_concurrency): Process the --concurrency=N option value. (check): Use new xfopen calling convention. Pass -1 for FD. (mergefps): Use new xfopen calling convention. Pass the FILE's FD for input and SORTER_OUTPUT_FD for output. (sort): If MAX_CONCURRENCY allows, try to fork off SORTER children. Fork off a final child (the FEEDER) to read inputs and distribute among SORTER children. Merge SORTER output in the parent. --- src/sort.c | 288 ++- 1 files changed, 263 insertions(+), 25 deletions(-) diff --git a/src/sort.c b/src/sort.c index 2039dab..18a8882 100644 --- a/src/sort.c +++ b/src/sort.c @@ -101,6 +101,11 @@ enum enum { +/* The number of times we should try to fork a child to help with + a large sort. We can always sort everything ourselves if need + be so this number can be small. */ +MAX_FORK_TRIES_SORT = 2, + /* The number of times we should try to fork a compression process (we retry if the fork call fails). We don't _need_ to compress temp files, this is just to reduce disk access, so this number @@ -223,6 +228,10 @@ static struct month monthtab[] = {SEP, 9} }; +/* How much data a parallel sort feeder will give to each sorter + at a time. */ +#define FEEDER_BUF_SIZE 65536 + /* During the merge phase, the number of files to merge at once. */ #define NMERGE_DEFAULT 16 @@ -285,10 +294,19 @@ static struct keyfield *keylist; /* Program used to (de)compress temp files. Must accept -d. */ static char const *compress_program; +/* Maximum number of sorters that may be run in parallel. + This can be modified on the command line with the --concurrency + option. */ +static unsigned
Re: Multi-threading in sort(or core-utils)
Bo Borgerson [EMAIL PROTECTED] writes: Does this sound like a step in the right direction for sort? If I were to clean this up and submit it would you be willing to assess its viability as a portable improvement? Yes, and yes. And thanks! ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: Multi-threading in sort(or core-utils)
[EMAIL PROTECTED] wrote: I think it is good idea to make option(or by default) for sorting in threads to increase performance on systems that might execute more than one thread in parallel. Klimentov Konstantin. I agree. That's been on my to-do list for years. (It shouldn't be that hard, if you ignore portability hassles. :-) Pádraig Brady [EMAIL PROTECTED] writes: 5. I think it would be nice for dd to support reading portions of a file efficiently. As far as I can see it can only do it by reading 1 byte at a time. Perhaps skip_bytes=x and count_bytes=x would be useful additions? These also sound like good suggestions. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: Multi-threading in sort(or core-utils)
Paul Eggert wrote: [EMAIL PROTECTED] wrote: I think it is good idea to make option(or by default) for sorting in threads to increase performance on systems that might execute more than one thread in parallel. Klimentov Konstantin. I agree. That's been on my to-do list for years. (It shouldn't be that hard, if you ignore portability hassles. :-) Hi Paul, I've modified my local sort to parallelize large merges by dividing inputs among a number of children whose outputs are merged by the parent. This only benefits large bulk sorts indirectly by parallelizing the merge of temp files, but it can still provide a performance improvement. I suspect my implementation does ignore some portability hassles, but only because I haven't encountered them yet. :) Does this sound like a step in the right direction for sort? If I were to clean this up and submit it would you be willing to assess its viability as a portable improvement? Thanks, Bo ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: Multi-threading in sort(or core-utils)
[EMAIL PROTECTED] wrote: Hello Few minutes ago i used sort -u for sorting big file(236 Mb). I have 2 core cpu(core 2 duo), but i found that sort use only one cpu(work in one thread). I think it is good idea to make option(or by default) for sorting in threads to increase performance on systems that might execute more than one thread in parallel. Klimentov Konstantin. Hi, If you're using a shell that supports process substitution you could try splitting your file in half and putting the bulk sorts of each half as inputs to a merge: So if you were doing: $ sort bigfile You could do: $ sort -m (sort bigfile.firsthalf) (sort bigfile.secondhalf) Bo ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: Multi-threading in sort(or core-utils)
Bo Borgerson wrote: [EMAIL PROTECTED] wrote: Hello Few minutes ago i used sort -u for sorting big file(236 Mb). I have 2 core cpu(core 2 duo), but i found that sort use only one cpu(work in one thread). I think it is good idea to make option(or by default) for sorting in threads to increase performance on systems that might execute more than one thread in parallel. Klimentov Konstantin. Hi, If you're using a shell that supports process substitution you could try splitting your file in half and putting the bulk sorts of each half as inputs to a merge: So if you were doing: $ sort bigfile You could do: $ sort -m (sort bigfile.firsthalf) (sort bigfile.secondhalf) A few notes about that 1. `LANG=C sort ...` may be appropriate for your data, and should be much quicker than using the collating routines for your locale 2. How to split a text file into chunks correctly is not obvious to me. Here is a little snippet that might work: file=$1; chunks=2 size=$(find $file -printf %s) line_fuzz=80 #to avoid single line for last chunk chunk_max=$((($size+$line_fuzz)/$chunks)) split --line-bytes=$chunk_max $file 3. The file chunks if, on traditional hard disks, should be on separate spindles. I.E. if both sort processes are reading off the same spindle, they will be fighting over the disk heads and just slow things down a lot. 4. If processing from a Solid State Disk, or doing multiple runs from cache etc, it will probably be quicker to process the portions of the file directly, without having to split them up first. Here is a snippet to process (approximately) each half of a text file directly: size=$(find $1 -printf %s) half=$(($size/2)) next=$(dd if=$1 bs=1 skip=$half 2/dev/null | sed q | wc -c) chunk2_s=$(($half+$next)) #note head will read in blocks of 8192 sort -m (head -c $chunk2_s $1 | sort) \ (tail -c +$(($chunk2_s+1)) $1 | sort) 5. I think it would be nice for dd to support reading portions of a file efficiently. As far as I can see it can only do it by reading 1 byte at a time. Perhaps skip_bytes=x and count_bytes=x would be useful additions? cheers, Pádraig. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils