Re: Multi-threading in sort(or core-utils)

2008-11-27 Thread Pádraig Brady
On 16th June 2008, Paul Eggert wrote:
 
 Pádraig Brady [EMAIL PROTECTED] writes:
 
 5. I think it would be nice for dd to support reading portions of
 a file efficiently. As far as I can see it can only do it by reading
 1 byte at a time. Perhaps skip_bytes=x and count_bytes=x would
 be useful additions?
 
 These also sound like good suggestions.

I just noticed in the info docs for dd that there
is a method to achieve skip_bytes at least:

(dd bs=1 skip=123456 count=0  dd bs=8192)  file

Pádraig.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Multi-threading in sort(or core-utils)

2008-06-28 Thread Bo Borgerson
James Youngman wrote:
 On Thu, Jun 26, 2008 at 1:22 AM, Bo Borgerson [EMAIL PROTECTED] wrote:
 If all inputs are regular files then SORTERS read directly rather than
 being fed by an extra process.
 
 Does that work with multi-byte character sets?


Hi James,

Each sorter's portion of input is delineated along line boundaries as
detected by the main buffer-filling routine.  I don't think any
multi-byte character set problems should have been introduced.

What type of issue specifically concerns you?  I'm going to start
putting together some tests soon, and I'd like to include a test case
that would exercise the type of bug you have in mind.

Thanks,

Bo


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Multi-threading in sort(or core-utils)

2008-06-28 Thread James Youngman
On Sat, Jun 28, 2008 at 4:27 PM, Bo Borgerson [EMAIL PROTECTED] wrote:
 Hi James,

 Each sorter's portion of input is delineated along line boundaries as
 detected by the main buffer-filling routine.  I don't think any
 multi-byte character set problems should have been introduced.

I see.   I agree, then.

 What type of issue specifically concerns you?  I'm going to start
 putting together some tests soon, and I'd like to include a test case
 that would exercise the type of bug you have in mind.

Given that you had done away with the feeder, I had assumed that you'd
just set things up so that for N sort workers on an M byte input,
worker i would start at the next line boundary after byte offset
(i*M/N).   I was essentially wondering how you could accurately
determine the actual line boundaries in a multibyte file.But I now
understand that you are not doing things that way.

James.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Multi-threading in sort(or core-utils)

2008-06-25 Thread Bo Borgerson
Bo Borgerson wrote:
 Cons:
 - It's limited by the feeder.  If the sorters were able to read at their
 own pace I think this would scale better.
 - It uses N+2 processes.  When sorters are run in parallel there are two
 helper processes, one feeding input and one merging output.

Hello again.

In light of the drawbacks mentioned for my previous parallel sort patch,
I've made the attached modifications.

If all inputs are regular files then SORTERS read directly rather than
being fed by an extra process.

This exhibits better performance with a concurrency of 2, but still does
not realize the full benefit of greater concurrency that I was expecting:

-

$ for i in 0 1 2 3; do cat /dev/urandom | base64 | head -200 | cut
-da -f1  p$i; done

$ time ~/sort p0 p1 p2 p3  /dev/null

real0m31.444s

$ time ~/sort --concurrency=2 p0 p1 p2 p3  /dev/null

real0m16.908s

$ time ~/sort --concurrency=4 p0 p1 p2 p3  /dev/null

real0m15.353s

$ time ~/sort -m (~/sort p0 p1) (~/sort p2 p3)  /dev/null

real0m17.066s -- similar to --concurrency=2

$ time ~/sort -m (~/sort p0) (~/sort p1) (~/sort p2) (~/sort p3) 
/dev/null

real0m10.832s -- _this_ is what I want from --concurrency=4!

-

Jim pointed out a mistake in my performance testing script that was
causing me to use a smaller sample size for each data point than I
intended.  I've attached a version with his patch.

Thanks,

Bo
inline: bench_output_server_2.pngFrom 298d6871aee1a1a506015681fb88ce5ba2b24644 Mon Sep 17 00:00:00 2001
From: Bo Borgerson [EMAIL PROTECTED]
Date: Tue, 24 Jun 2008 14:02:22 -0400
Subject: [PATCH] o sort: Don't use a feeder for regular file concurrency.

* src/sort.c (xlseek): Try to lseek, complain on failure.  Stolen from src/tail.c.
(fillbuf): If SORTER_BYTES_LEFT is non-negative, treat it as a limit.
(sort): When running multiple sorters concurrently, if all inputs are regular
files set sorters up to read directly rather than spawning a feeder to distribute
work among them.
---
 src/sort.c |  220 ---
 1 files changed, 194 insertions(+), 26 deletions(-)

diff --git a/src/sort.c b/src/sort.c
index 18a8882..4aa7609 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -307,6 +307,11 @@ static unsigned int nmerge = NMERGE_DEFAULT;
their output through a pipe to the parent who will merge. */
 static int sorter_output_fd = -1;
 
+/* If multiple sorters are each reading their own input rather
+   than being fed by a single process then they'll have a cap
+   on how much they can read. */
+static size_t sorter_bytes_left = -1;
+
 static void sortlines_temp (struct line *, size_t, struct line *);
 
 /* Report MESSAGE for FILE, then clean up and exit.
@@ -831,6 +836,45 @@ xfclose (FILE *fp, char const *file)
 }
 }
 
+/* Call lseek with the specified arguments, where file descriptor FD
+   corresponds to the file, FILENAME.
+   Give a diagnostic and exit nonzero if lseek fails.
+   Otherwise, return the resulting offset.
+
+   This is stolen from src/tail.c */
+
+static off_t
+xlseek (int fd, off_t offset, int whence, char const *filename)
+{
+  off_t new_offset = lseek (fd, offset, whence);
+  char buf[INT_BUFSIZE_BOUND (off_t)];
+  char *s;
+
+  if (0 = new_offset)
+return new_offset;
+
+  s = offtostr (offset, buf);
+  switch (whence)
+{
+case SEEK_SET:
+  error (0, errno, _(%s: cannot seek to offset %s),
+	 filename, s);
+  break;
+case SEEK_CUR:
+  error (0, errno, _(%s: cannot seek to relative offset %s),
+	 filename, s);
+  break;
+case SEEK_END:
+  error (0, errno, _(%s: cannot seek to end-relative offset %s),
+	 filename, s);
+  break;
+default:
+  abort ();
+}
+
+  exit (EXIT_FAILURE);
+}
+
 static void
 dup2_or_die (int oldfd, int newfd)
 {
@@ -1556,7 +1600,8 @@ fillbuf (struct buffer *buf, FILE *fp, char const *file)
   size_t line_bytes = buf-line_bytes;
   size_t mergesize = merge_buffer_size - MIN_MERGE_BUFFER_SIZE;
 
-  if (buf-eof)
+
+  if (buf-eof || 0 == sorter_bytes_left)
 return false;
 
   if (buf-used != buf-left)
@@ -1582,9 +1627,23 @@ fillbuf (struct buffer *buf, FILE *fp, char const *file)
 	 rest of the input file consists entirely of newlines,
 	 except that the last byte is not a newline.  */
 	  size_t readsize = (avail - 1) / (line_bytes + 1);
-	  size_t bytes_read = fread (ptr, 1, readsize, fp);
-	  char *ptrlim = ptr + bytes_read;
+	  size_t bytes_read;
+	  char *ptrlim;
 	  char *p;
+
+	  if (0  sorter_bytes_left)
+	readsize = MIN (readsize, sorter_bytes_left);
+
+	  bytes_read = fread (ptr, 1, readsize, fp);
+
+	  if (0  sorter_bytes_left)
+	sorter_bytes_left -= bytes_read;
+
+	  /* This is a fake end-of-file for this sorter child. */
+	  if (0 == sorter_bytes_left)
+	buf-eof = true;
+
+	  ptrlim = ptr + bytes_read;
 	  avail -= bytes_read;
 
 	  if (bytes_read != readsize)
@@ -2627,6 +2686,9 @@ sort (char * const *incoming_files, size_t nfiles, 

Re: Multi-threading in sort(or core-utils)

2008-06-22 Thread Bo Borgerson
Paul Eggert wrote:
 Bo Borgerson [EMAIL PROTECTED] writes:
 
 Does this sound like a step in the right direction for sort?  If I were
 to clean this up and submit it would you be willing to assess its
 viability as a portable improvement?
 
 Yes, and yes.  And thanks!

Hi Paul,

When I isolated my parallel merge enhancement I discovered that the
improvement I was seeing was mostly the result of my not having properly
divided resources (particularly NMERGE) among children.  Aside from some
benefit to unique merges with many duplicates I wasn't able to produce
satisfactory results using this approach.

So I started from scratch on a parallel bulk sort enhancement.  Here I
was able to see some modest but reliable improvement.

The approach I took was to divide the main work among a number of
children (sorters) by using an additional child (the feeder) to read
inputs and distribute data among them.  The parent then merges sorter
output.

This approach has some pros and cons in my view:

Pros:
- It's simple.  Sorters don't need to worry about what their siblings
are doing.  They just process the data they're fed.
- It doesn't require a known amount of data.  Work is distributed among
sorters by the feeder in small interleaved chunks.
- It doesn't require regular files.  Data coming through FIFOs or from
sub-pipelines via process substitution is no problem.
- It limits increased resource consumption.  The feeder is the only
process reading from disk until/unless the sorters need temporary files.
 NMERGE and SORT_BUFFER_SIZE are divided among sorters.

Cons:
- It's limited by the feeder.  If the sorters were able to read at their
own pace I think this would scale better.
- It uses N+2 processes.  When sorters are run in parallel there are two
helper processes, one feeding input and one merging output.

I've attached the results of some performance testing I did both on my
laptop (dual core) and on a server (4x dual core, hyper-threaded == 16
processors visible).  I included two graphs of the server results
which I thought were interesting.  Each line represents a level of
concurrency.  One graph shows time in seconds on the Y axis while the
other shows percentage of single-process time.  In both cases lower
lines indicate better performance.  As you can see, even on the 16
processor machine performance peaks at a low concurrency.

I included the script I used for testing in case anyone else is
interested and has a machine they're willing to run hot for a few hours.

The attached patch is also available for fetch from
git://repo.or.cz/coreutils/bo.git as branch 'sort'.

I haven't included any tests or documentation in the patch yet.  I was
hoping to first get a sense of whether you and other more experienced
coreutils developers consider this alternate approach to be worth pursuing.

Thanks,

Bo
inline: bench_output_server_seconds.pnginline: bench_output_server_percentage.pngFrom e31f3f11a2d06079182ae7892e3af280dc4044cc Mon Sep 17 00:00:00 2001
From: Bo Borgerson [EMAIL PROTECTED]
Date: Wed, 18 Jun 2008 09:59:46 -0400
Subject: [PATCH] sort: Add new option --concurrency=N.

* src/sort.c (xfopen): Take an additional argument, FD. If FILE is NULL
then fdopen FD instead.  If FD is -1, use STDOUT as before.
(specify_concurrency): Process the --concurrency=N option value.
(check): Use new xfopen calling convention.  Pass -1 for FD.
(mergefps): Use new xfopen calling convention.  Pass the FILE's FD for input
and SORTER_OUTPUT_FD for output.
(sort): If MAX_CONCURRENCY allows, try to fork off SORTER children.
Fork off a final child (the FEEDER) to read inputs and distribute among
SORTER children.  Merge SORTER output in the parent.
---
 src/sort.c |  288 ++-
 1 files changed, 263 insertions(+), 25 deletions(-)

diff --git a/src/sort.c b/src/sort.c
index 2039dab..18a8882 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -101,6 +101,11 @@ enum
 
 enum
   {
+/* The number of times we should try to fork a child to help with
+   a large sort.  We can always sort everything ourselves if need
+   be so this number can be small. */
+MAX_FORK_TRIES_SORT = 2,
+
 /* The number of times we should try to fork a compression process
(we retry if the fork call fails).  We don't _need_ to compress
temp files, this is just to reduce disk access, so this number
@@ -223,6 +228,10 @@ static struct month monthtab[] =
   {SEP, 9}
 };
 
+/* How much data a parallel sort feeder will give to each sorter
+   at a time. */
+#define FEEDER_BUF_SIZE 65536
+
 /* During the merge phase, the number of files to merge at once. */
 #define NMERGE_DEFAULT 16
 
@@ -285,10 +294,19 @@ static struct keyfield *keylist;
 /* Program used to (de)compress temp files.  Must accept -d.  */
 static char const *compress_program;
 
+/* Maximum number of sorters that may be run in parallel.
+   This can be modified on the command line with the --concurrency
+   option. */
+static unsigned 

Re: Multi-threading in sort(or core-utils)

2008-06-20 Thread Paul Eggert
Bo Borgerson [EMAIL PROTECTED] writes:

 Does this sound like a step in the right direction for sort?  If I were
 to clean this up and submit it would you be willing to assess its
 viability as a portable improvement?

Yes, and yes.  And thanks!


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Multi-threading in sort(or core-utils)

2008-06-16 Thread Paul Eggert
 [EMAIL PROTECTED] wrote:
 I think it is good idea to make option(or by default) for sorting
 in threads to increase performance on systems that might execute
 more than one thread in parallel.
Klimentov Konstantin.

I agree.  That's been on my to-do list for years.  (It shouldn't be
that hard, if you ignore portability hassles.  :-)

Pádraig Brady [EMAIL PROTECTED] writes:

 5. I think it would be nice for dd to support reading portions of
 a file efficiently. As far as I can see it can only do it by reading
 1 byte at a time. Perhaps skip_bytes=x and count_bytes=x would
 be useful additions?

These also sound like good suggestions.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Multi-threading in sort(or core-utils)

2008-06-16 Thread Bo Borgerson
Paul Eggert wrote:
 [EMAIL PROTECTED] wrote:
 I think it is good idea to make option(or by default) for sorting
 in threads to increase performance on systems that might execute
 more than one thread in parallel.
Klimentov Konstantin.
 
 I agree.  That's been on my to-do list for years.  (It shouldn't be
 that hard, if you ignore portability hassles.  :-)
 

Hi Paul,

I've modified my local sort to parallelize large merges by dividing
inputs among a number of children whose outputs are merged by the parent.

This only benefits large bulk sorts indirectly by parallelizing the
merge of temp files, but it can still provide a performance improvement.

I suspect my implementation does ignore some portability hassles, but
only because I haven't encountered them yet. :)

Does this sound like a step in the right direction for sort?  If I were
to clean this up and submit it would you be willing to assess its
viability as a portable improvement?

Thanks,

Bo


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Multi-threading in sort(or core-utils)

2008-06-13 Thread Bo Borgerson
[EMAIL PROTECTED] wrote:
 Hello
 Few minutes ago i used sort -u for sorting big file(236 Mb). I have 2 core 
 cpu(core 2 duo), but i found that sort use only one cpu(work in one thread). 
 I think it is good idea to make option(or by default) for sorting in threads 
 to increase performance on systems that might execute more than one thread in 
 parallel.
Klimentov Konstantin.

Hi,

If you're using a shell that supports process substitution you could try
splitting your file in half and putting the bulk sorts of each half as
inputs to a merge:

So if you were doing:

$ sort bigfile

You could do:

$ sort -m (sort bigfile.firsthalf) (sort bigfile.secondhalf)

Bo


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Multi-threading in sort(or core-utils)

2008-06-13 Thread Pádraig Brady
Bo Borgerson wrote:
 [EMAIL PROTECTED] wrote:
 Hello
 Few minutes ago i used sort -u for sorting big file(236 Mb). I have 2 core 
 cpu(core 2 duo), but i found that sort use only one cpu(work in one thread). 
 I think it is good idea to make option(or by default) for sorting in threads 
 to increase performance on systems that might execute more than one thread 
 in parallel.
Klimentov Konstantin.
 
 Hi,
 
 If you're using a shell that supports process substitution you could try
 splitting your file in half and putting the bulk sorts of each half as
 inputs to a merge:
 
 So if you were doing:
 
 $ sort bigfile
 
 You could do:
 
 $ sort -m (sort bigfile.firsthalf) (sort bigfile.secondhalf)

A few notes about that

1. `LANG=C sort ...` may be appropriate for your data, and should
be much quicker than using the collating routines for your locale

2. How to split a text file into chunks correctly is not
obvious to me. Here is a little snippet that might work:

file=$1; chunks=2
size=$(find $file -printf %s)
line_fuzz=80 #to avoid single line for last chunk
chunk_max=$((($size+$line_fuzz)/$chunks))
split --line-bytes=$chunk_max $file

3. The file chunks if, on traditional hard disks,
should be on separate spindles. I.E. if both sort processes
are reading off the same spindle, they will be fighting over
the disk heads and just slow things down a lot.

4. If processing from a Solid State Disk, or doing multiple runs
from cache etc, it will probably be quicker to process the portions
of the file directly, without having to split them up first.
Here is a snippet to process (approximately) each half of
a text file directly:

size=$(find $1 -printf %s)
half=$(($size/2))
next=$(dd if=$1 bs=1 skip=$half 2/dev/null | sed q | wc -c)
chunk2_s=$(($half+$next))
#note head will read in blocks of 8192
sort -m (head -c $chunk2_s $1 | sort) \
(tail -c +$(($chunk2_s+1)) $1 | sort)

5. I think it would be nice for dd to support reading portions of
a file efficiently. As far as I can see it can only do it by reading
1 byte at a time. Perhaps skip_bytes=x and count_bytes=x would
be useful additions?

cheers,
Pádraig.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils