Threaded versions of cp, mv, ls for high latency / parallel filesystems?

2008-11-08 Thread Andrew McGill
Greetings coreutils folks,

There are a number of interesting filesystems (glusterfs, lustre? ... NFS) 
which could benefit from userspace utilities doing certain operatings in 
parallel.  (I have a very slow glusterfs installation that makes me think 
that some things can be done better.)

For example, copying a number of files is currently done in series ...
cp a b c d e f g h dest/
but, on certain filesystems, it would be roughly twice as efficient if 
implemented in two parallel threads, something like:
cp a c e g dest/ &
cp b d f h dest/
since the source and destination files can be stored on multiple physical 
volumes.  

Simlarly, ls -l . will readdir(), and then stat() each file in the directory.  
On a filesystem with high latency, it would be faster to issue the stat() 
calls asynchronously, and in parallel, and then collect the results for 
display.  (This could improve performance for NFS, in proportion to the 
latency and the number of threads.)


Question:  Is there already a set of "improved" utilities that implement this 
kind of technique?  If not, would this kind of performance enhancements be 
considered useful?  (It would mean introducing threading into programs which 
are currently single-threaded.)


To the user, it could look very much the same ...
export GNU_COREUTILS_THREADS=8
cp   # manipulate multiple files simultaneously
mv   # manipulate multiple files simultaneously
ls   # stat() multiple files simultaneously

One could also optimise the text utilities like cat by doing the open() and 
stat() operations in parallel and in the background -- userspace read-ahead 
caching.  All of the utilities which process mutliple files could get 
small speed boosts from this -- rm, cat, chown, chmod ... even tail, head, 
wc -- but probably only on network filesystems.

&:-)


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Threaded versions of cp, mv, ls for high latency / parallel filesystems?

2008-11-08 Thread Jim Meyering
Andrew McGill <[EMAIL PROTECTED]> wrote:
> Greetings coreutils folks,
>
> There are a number of interesting filesystems (glusterfs, lustre? ... NFS)
> which could benefit from userspace utilities doing certain operatings in
> parallel.  (I have a very slow glusterfs installation that makes me think
> that some things can be done better.)
>
> For example, copying a number of files is currently done in series ...
>   cp a b c d e f g h dest/
> but, on certain filesystems, it would be roughly twice as efficient if
> implemented in two parallel threads, something like:
>   cp a c e g dest/ &
>   cp b d f h dest/
> since the source and destination files can be stored on multiple physical
> volumes.

How about parallelizing it via xargs, e.g.,

$ echo a b c d e f g h | xargs -t -n4 --no-run-if-empty \
  --max-procs=2 -- cp --target-directory=dest
cp --target-directory=dest a b c d
cp --target-directory=dest e f g h

Obviously the above is tailored (-L4) to your 8-input example.
In practice, you'd use a larger number, unless latency is
so high as to dwarf the cost of extra "fork/exec" syscalls,
in which case even -L1 might make sense.

mv and ln also accept the --target-directory=dest option.

> Simlarly, ls -l . will readdir(), and then stat() each file in the directory.
> On a filesystem with high latency, it would be faster to issue the stat()
> calls asynchronously, and in parallel, and then collect the results for

If you can demonstrate a large performance gain on
systems that many people use, then maybe...

There is more than a little value in keeping programs
like those in the coreutils package relatively simple,
but if the cost(maintenance+portability burden)/benefit
ratio is low enough, then anything is possible.

For example, a well-encapsulated, optionally-threaded
"stat_all_dir_entries" API might be useful in some situations.

If getting any eventual patch into upstream coreutils is
important to you, be sure there is some consensus on this
list before doing a lot of work on it.

> display.  (This could improve performance for NFS, in proportion to the
> latency and the number of threads.)
>
>
> Question:  Is there already a set of "improved" utilities that implement this
> kind of technique?

Not that I know of.

> If not, would this kind of performance enhancements be
> considered useful?

It's impossible to say without knowing more.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Threaded versions of cp, mv, ls for high latency / parallel filesystems?

2008-11-08 Thread James Youngman
On Sat, Nov 8, 2008 at 6:05 PM, Jim Meyering <[EMAIL PROTECTED]> wrote:
> How about parallelizing it via xargs, e.g.,
>
>$ echo a b c d e f g h | xargs -t -n4 --no-run-if-empty \
>  --max-procs=2 -- cp --target-directory=dest
>cp --target-directory=dest a b c d
>cp --target-directory=dest e f g h

For tools lacking a --target-directory option there is this shell trick:

$ echo a b c d e f g h |
 xargs -n4 --no-run-if-empty --max-procs=2 -- sh -c 'prog "$@" destination' prog

James.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Threaded versions of cp, mv, ls for high latency / parallel filesystems?

2008-11-09 Thread James Youngman
On Sun, Nov 9, 2008 at 11:06 PM, Dr. David Alan Gilbert
<[EMAIL PROTECTED]> wrote:
> I keep wondering if the OS level needs a better interface; an 'openv' or 
> 'statv'
> or I'm currently wondering if a combined call would work - something which
> would stat a path, if it's a normal file, open it, read upto a buffers worth
> and if finished close it - it might work nicely for small files.

I suspect that a combined call would not be widely useful, though it
would likely provide a useful speedup for your use case.

I suspect that the statv/openv combination would fit more use-cases.
A statv function could be useful for anything that uses fts for
example (rm, find, ...) and for file-open dialogue boxes.

I have to say though that I've never used writev.   However, people
continue to design more advanced filesystems; the filesystem knows a
lot about how the data is arranged and (therefore) the optimal order
in which to perform operations.   The application knows a lot about
the set of operations it plans to execute, too.   However, these two
pieces of software communicate through a small keyhole; the POSIX file
API.  I'm not clear though on what nature of API might be more
generally useful for a wide class of programs; existing programs after
all are designed in ways that work well with the existing operating
system interfaces.  Perhaps this overcomplicates the issue though,
since not many programs interact with more than a few dozen files and
therefore probably wouldn't need to adopt a more complex API.

Thanks,
James.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Threaded versions of cp, mv, ls for high latency / parallel filesystems?

2008-11-09 Thread Dr. David Alan Gilbert
* Andrew McGill ([EMAIL PROTECTED]) wrote:
> Greetings coreutils folks,
> 
> There are a number of interesting filesystems (glusterfs, lustre? ... NFS) 
> which could benefit from userspace utilities doing certain operatings in 
> parallel.  (I have a very slow glusterfs installation that makes me think 
> that some things can be done better.)
> 
> For example, copying a number of files is currently done in series ...
>   cp a b c d e f g h dest/
> but, on certain filesystems, it would be roughly twice as efficient if 
> implemented in two parallel threads, something like:
>   cp a c e g dest/ &
>   cp b d f h dest/
> since the source and destination files can be stored on multiple physical 
> volumes.  

Of course you can't do that by hand since each might be a directory with an
unbalanced number of files etc - so you are right, something smarter
is needed (my pet hate is 'tar' or 'cp' working it's way through a 
source tree of thousands of small files).

> Simlarly, ls -l . will readdir(), and then stat() each file in the directory. 
>  
> On a filesystem with high latency, it would be faster to issue the stat() 
> calls asynchronously, and in parallel, and then collect the results for 
> display.  (This could improve performance for NFS, in proportion to the 
> latency and the number of threads.)

I think, as you are suggesting, you have to end up doing threading
in the userland code which to me seems to be mad since the code doesn't
really know how wide to go and it's a fair overhead.  In addition this
behaviour can be really bad if you get it wrong - for example
if 'dest' is a single disc then having multiple writers writing two
large files leads to fragmentation on many filesystems.

I once tried to write a backup system that streamed data from 10's of machines
trying to write a few MB at a time on Linux, each machine being a separate
process; unfortuantely the kernel was too smart and ended up writing a few
KB from each process before moving onto the next leading to *awful* throughput.

> Question:  Is there already a set of "improved" utilities that implement this 
> kind of technique?  If not, would this kind of performance enhancements be 
> considered useful?  (It would mean introducing threading into programs which 
> are currently single-threaded.)
> 
> 
> One could also optimise the text utilities like cat by doing the open() and 
> stat() operations in parallel and in the background -- userspace read-ahead 
> caching.  All of the utilities which process mutliple files could get 
> small speed boosts from this -- rm, cat, chown, chmod ... even tail, head, 
> wc -- but probably only on network filesystems.

I keep wondering if the OS level needs a better interface; an 'openv' or 'statv'
or I'm currently wondering if a combined call would work - something which
would stat a path, if it's a normal file, open it, read upto a buffers worth
and if finished close it - it might work nicely for small files.


Dave
-- 
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert| Running GNU/Linux on Alpha,68K| Happy  \ 
\ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex /
 \ _|_ http://www.treblig.org   |___/


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Threaded versions of cp, mv, ls for high latency / parallel filesystems?

2008-11-11 Thread Andrew McGill
On Saturday 08 November 2008 20:05:25 Jim Meyering wrote:
> Andrew McGill <[EMAIL PROTECTED]> wrote:
> > Greetings coreutils folks,
> >
> > There are a number of interesting filesystems (glusterfs, lustre? ...
> > NFS) which could benefit from userspace utilities doing certain
> > operatings in parallel.  (I have a very slow glusterfs installation that
> > makes me think that some things can be done better.)
> >
> > For example, copying a number of files is currently done in series ...
> > cp a b c d e f g h dest/
> > but, on certain filesystems, it would be roughly twice as efficient if
> > implemented in two parallel threads, something like:
> > cp a c e g dest/ &
> > cp b d f h dest/
> > since the source and destination files can be stored on multiple physical
> > volumes.
>
> How about parallelizing it via xargs, e.g.,
>
> $ echo a b c d e f g h | xargs -t -n4 --no-run-if-empty \
>   --max-procs=2 -- cp --target-directory=dest
> cp --target-directory=dest a b c d
> cp --target-directory=dest e f g h
>
> Obviously the above is tailored (-L4) to your 8-input example.
> In practice, you'd use a larger number, unless latency is
> so high as to dwarf the cost of extra "fork/exec" syscalls,
> in which case even -L1 might make sense.
I did the command above with md5sum as the command, and got missing lines in 
the output.  I optimistically hoped that would not happen!

> mv and ln also accept the --target-directory=dest option.
>
> > Simlarly, ls -l . will readdir(), and then stat() each file in the
> > directory. On a filesystem with high latency, it would be faster to issue
> > the stat() calls asynchronously, and in parallel, and then collect the
> > results for
>
> If you can demonstrate a large performance gain on
> systems that many people use, then maybe...
>
> There is more than a little value in keeping programs
> like those in the coreutils package relatively simple,
> but if the cost(maintenance+portability burden)/benefit
> ratio is low enough, then anything is possible.
>
> For example, a well-encapsulated, optionally-threaded
> "stat_all_dir_entries" API might be useful in some situations.
So a relatively small change for parallel stat() in "ls" could fly.

> If getting any eventual patch into upstream coreutils is
> important to you, be sure there is some consensus on this
> list before doing a lot of work on it.
Any ideas on how to do a parallel cp / mv in a way that is not Considered 
Harmful?  Maybe prefetch_files(max_bytes,file1,...,NULL) ... aargh.

> > display.  (This could improve performance for NFS, in proportion to the
> > latency and the number of threads.)
> >
> >
> > Question:  Is there already a set of "improved" utilities that implement
> > this kind of technique?
>
> Not that I know of.
>
> > If not, would this kind of performance enhancements be
> > considered useful?
>
> It's impossible to say without knowing more.

On the (de?)merits of xargs for parallel processing:

What would you expect this to do --:

    find -type f -print0 | 
xargs -0 -n 8 --max-procs=16 md5sum >& ~/md5sums

    sort -k2 < md5sums > md5sums.sorted

Compared to this?

    find -type f -print0 | 
xargs -0                     md5sum >& ~/md5sums

    sort -k2 < md5sums > md5sums.sorted

I was a little surprised that on my system running in parallel (the first 
version) loses around 1 line of output per thousand (md5sum of 22Gb in mostly 
small files).  

Is there a correct way to do md5sums in parallel without having a shared 
output buffer which eats output (I presume) -- or is losing output when 
haphazardly combining output streams actually strange and unusual?


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Threaded versions of cp, mv, ls for high latency / parallel filesystems?

2008-11-12 Thread Phillip Susi

James Youngman wrote:

This version should be race-free:

find -type f -print0 |
 xargs -0 -n 8 --max-procs=16 md5sum >> ~/md5sums 2>&1

I think that writing into a pipe should be OK, since pipes are
non-seekable.  However, with pipes in this situation you still have a
problem if processes try to write more than PIPE_BUF bytes.


You aren't using a pipe there.  What you are doing is having the shell 
open the file, then the md5sum processes all inherit that fd so they all 
share the same offset.  As long as they write() the entire line at once, 
the file pointer will be updated atomically for all processes and the 
lines from each process won't clobber each other.




___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Threaded versions of cp, mv, ls for high latency / parallel filesystems?

2008-11-12 Thread James Youngman
[ CC ++ [EMAIL PROTECTED] ]


On Tue, Nov 11, 2008 at 2:58 PM, Andrew McGill <[EMAIL PROTECTED]> wrote:
> What would you expect this to do --:
>
> find -type f -print0 |
> xargs -0 -n 8 --max-procs=16 md5sum >& ~/md5sums

Produce a race condition :)It generates 16 parallel processes,
each writing to the md5sums file.  Unfortunately sometimes the writes
occur at the same offset in the output file. To illustrate:

~$ strace -f -e open,fork,execve sh -c "echo hello > foo"
execve("/bin/sh", ["sh", "-c", "echo hello > foo"], [/* 39 vars */]) = 0
[...]
open("foo", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
~$ strace -f -e open,fork,execve sh -c "echo hello >> foo"
execve("/bin/sh", ["sh", "-c", "echo hello >> foo"], [/* 39 vars */]) = 0
[...]
open("foo", O_WRONLY|O_CREAT|O_APPEND, 0666) = 3

This version should be race-free:

find -type f -print0 |
 xargs -0 -n 8 --max-procs=16 md5sum >> ~/md5sums 2>&1

I think that writing into a pipe should be OK, since pipes are
non-seekable.  However, with pipes in this situation you still have a
problem if processes try to write more than PIPE_BUF bytes.


> Is there a correct way to do md5sums in parallel without having a shared
> output buffer which eats output (I presume) -- or is losing output when
> haphazardly combining output streams actually strange and unusual?

I hope the solution about solved your problem - and please follow up
if so.  This example is probably worthy of being mentioned in the
xargs documentation, too.

Thanks for your comment!

James.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Threaded versions of cp, mv, ls for high latency / parallel filesystems?

2008-11-13 Thread Ralf Wildenhues
Hello Andrew,

Andrew McGill  lunch.za.net> writes:
> 
>     find -type f -print0 | 
> xargs -0 -n 8 --max-procs=16 md5sum >& ~/md5sums
> 
>     sort -k2 < md5sums > md5sums.sorted

To avoid losing output, use append mode for writing:
 : > ~/md5sums
 find -type f -print0 | 
 xargs -0 -n 8 --max-procs=16 md5sum >> ~/md5sums 2>&1

 sort -k2 < md5sums > md5sums.sorted

This just recently came up in Autoconf:


Cheers,
Ralf



___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Threaded versions of cp, mv, ls for high latency / parallel filesystems?

2008-11-13 Thread James Youngman
On Wed, Nov 12, 2008 at 4:01 PM, Phillip Susi <[EMAIL PROTECTED]> wrote:
> James Youngman wrote:
>>
>> This version should be race-free:
>>
>> find -type f -print0 |
>> xargs -0 -n 8 --max-procs=16 md5sum >> ~/md5sums 2>&1
>>
>> I think that writing into a pipe should be OK, since pipes are
>> non-seekable.  However, with pipes in this situation you still have a
>> problem if processes try to write more than PIPE_BUF bytes.
>
> You aren't using a pipe there.

I know this.   My point was that >> ensures that the file is opened in
O_APPEND mode, but that the equivalent xargs command with a pipe
instead of a file redirection is safe.

> What you are doing is having the shell open
> the file, then the md5sum processes all inherit that fd so they all share
> the same offset.  As long as they write() the entire line at once, the file
> pointer will be updated atomically for all processes and the lines from each
> process won't clobber each other.
>
>


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Threaded versions of cp, mv, ls for high latency / parallel filesystems?

2008-11-13 Thread Andrew McGill
On Thursday 13 November 2008 14:52:44 Ralf Wildenhues wrote:
> Hello Andrew,
>
> Andrew McGill  lunch.za.net> writes:
> >     find -type f -print0 |
> > xargs -0 -n 8 --max-procs=16 md5sum >& ~/md5sums
> >
> >     sort -k2 < md5sums > md5sums.sorted
>
> To avoid losing output, use append mode for writing:
>  : > ~/md5sums
>
>  find -type f -print0 |
>  xargs -0 -n 8 --max-procs=16 md5sum >> ~/md5sums 2>&1
>
>  sort -k2 < md5sums > md5sums.sorted
>
> This just recently came up in Autoconf:
> 
Ah!  I see!  So without O_APPEND, things don't work quite right.

At the risk of drifting off topic - is there ever a benefit in the shell 
implementing a ">"-redirection with just O_TRUNC , rather than O_TRUNC | 
O_APPEND ?   Does the output process ever need to seek() back in stdout?  (If 
this off topic, please feel free to flame me, and/or direct me to the correct 
forum -- but I did freely send a bug report to the bash folks, even though 
I'll bet they're not alone in omitting O_APPEND with O_TRUNC).

&:-)


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Threaded versions of cp, mv, ls for high latency / parallel filesystems?

2008-11-15 Thread James Youngman
On Fri, Nov 14, 2008 at 5:44 AM, Andrew McGill <[EMAIL PROTECTED]> wrote:
> At the risk of drifting off topic - is there ever a benefit in the shell
> implementing a ">"-redirection with just O_TRUNC , rather than O_TRUNC |
> O_APPEND ?

That is already the existing behaviour.

The >> redirection operator is needed to get O_APPEND.

> Does the output process ever need to seek() back in stdout?  (If
> this off topic, please feel free to flame me, and/or direct me to the correct
> forum -- but I did freely send a bug report to the bash folks, even though
> I'll bet they're not alone in omitting O_APPEND with O_TRUNC).

It's not a bug; POSIX requires >> to use O_APPEND and > not to.
See 
http://www.opengroup.org/onlinepubs/009695399/utilities/xcu_chap02.html#tag_02_07_02

James.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils