Feature request: handling space-delimited data with cut

2009-05-15 Thread Andrew McGill
A novice will cut out a list of PIDs like this:

 ps uaxw | grep nobody | cut -f 2  # huh?
 ps uaxw | grep nobody | cut -f 2 -d ' '   # oh, not tabs, but spaces. huh?
 ps uaxw | grep nobody | cut -f 4 -d ' '   # 'nobody' is followed by 3 spaces

After consulting 3 friends and 6 linux mailing lists about that one random 
line it prints out, they learn that the canonical ways of doing this simple 
task are all just horrid:

 ps uaxw | grep nobody | tr -s ' ' | cut -f 4 -d ' ' 
 ps uaxw | grep nobody | awk '{print $2}'
 ps uaxw | grep nobody | while read x y z; do echo $y; done


There are two problems:

 * Input is not tab delimited.  It might once have been, but currently the
   most interesting input is space delimited.

 * There is no SIMPLE tool for handling space delimited data ({'$awk'} is not
   simple, IMHO)


There are a few ways of fixing this:

 1. Adding an option (-w) similar to -f to cut words separated by whitespace 
(same rules as sort, unless the whitespace is changed to something else
with -d)

ps uaxwf | grep nobody | cut -w 2
ps uaxwf | grep nobody | cut --words 2

 2. Adding a switch (-w) similar to set the delimiter to multiple whitespace

ps uaxwf | grep nobody | cut -w -f 2
ps uaxwf | grep nobody | cut --whitespace -f 2

 3. Adding an option (-m) to merge delimiters, similar to what tr -s does, but 
without having to specify the delimiter twice on the command line

ps uaxwf | grep nobody | cut -m -d ' ' -f 2
ps uaxwf | grep nobody | cut --merge-delimiters -d ' ' -f 2

 4. Adding out of band meta information about the input stream, thereby  
violating everything unixy, and condemning generations of children not to 
understand that the scissors they are running with are sharp:

ps uaxwf | grep nobody | cut -f 2

It could be called Alternate Data Streams and Extended Attributes

 5. Embark on a campaign of education and training so that the true path to 
awk and tr enlightenment will be known to all.

I'm in favour of cut -w ... can I send a patch?

:-)


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: option abbreviation exceptions

2009-01-05 Thread Andrew McGill
On Tuesday 30 December 2008 15:00:18 Eric Blake wrote:
 According to Pádraig Brady on 12/30/2008 2:46 AM:
  Usage: truncate [OPTION]... [FILE]...
 
  Is supporting stdin a useful enhancement?
er ...
  Maybe if you can get the shell to open
  different files based on some condition,
  though again that seems a little contrived.

 if cond ; then
   foo=file1
 else
   foo=file2
 fi
 truncate -s0 $foo
This redirection is wonderful, but entirely counter-intuitive.  By convention 
stdout is where writes occur, stdin is where reads occur.  Modifying the file 
given as stdin is just a little unexpected.  

For good measure (all?) shells open stdin as read-only, which makes the 
operation fail -- ftruncate(0,0) gives invalid argument.  The redirection 
you need for a writable stdin under bash seems to be this one:
  truncate -s$SIZE 0$foo

:-)


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Threaded versions of cp, mv, ls for high latency / parallel filesystems?

2008-11-13 Thread Andrew McGill
On Thursday 13 November 2008 14:52:44 Ralf Wildenhues wrote:
 Hello Andrew,

 Andrew McGill list2008 at lunch.za.net writes:
      find -type f -print0 |
  xargs -0 -n 8 --max-procs=16 md5sum  ~/md5sums
 
      sort -k2  md5sums  md5sums.sorted

 To avoid losing output, use append mode for writing:
  :  ~/md5sums

  find -type f -print0 |
  xargs -0 -n 8 --max-procs=16 md5sum  ~/md5sums 21

  sort -k2  md5sums  md5sums.sorted

 This just recently came up in Autoconf:
 http://thread.gmane.org/gmane.comp.shells.bash.bugs/11958
Ah!  I see!  So without O_APPEND, things don't work quite right.

At the risk of drifting off topic - is there ever a benefit in the shell 
implementing a -redirection with just O_TRUNC , rather than O_TRUNC | 
O_APPEND ?   Does the output process ever need to seek() back in stdout?  (If 
this off topic, please feel free to flame me, and/or direct me to the correct 
forum -- but I did freely send a bug report to the bash folks, even though 
I'll bet they're not alone in omitting O_APPEND with O_TRUNC).

:-)


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Threaded versions of cp, mv, ls for high latency / parallel filesystems?

2008-11-11 Thread Andrew McGill
On Saturday 08 November 2008 20:05:25 Jim Meyering wrote:
 Andrew McGill [EMAIL PROTECTED] wrote:
  Greetings coreutils folks,
 
  There are a number of interesting filesystems (glusterfs, lustre? ...
  NFS) which could benefit from userspace utilities doing certain
  operatings in parallel.  (I have a very slow glusterfs installation that
  makes me think that some things can be done better.)
 
  For example, copying a number of files is currently done in series ...
  cp a b c d e f g h dest/
  but, on certain filesystems, it would be roughly twice as efficient if
  implemented in two parallel threads, something like:
  cp a c e g dest/ 
  cp b d f h dest/
  since the source and destination files can be stored on multiple physical
  volumes.

 How about parallelizing it via xargs, e.g.,

 $ echo a b c d e f g h | xargs -t -n4 --no-run-if-empty \
   --max-procs=2 -- cp --target-directory=dest
 cp --target-directory=dest a b c d
 cp --target-directory=dest e f g h

 Obviously the above is tailored (-L4) to your 8-input example.
 In practice, you'd use a larger number, unless latency is
 so high as to dwarf the cost of extra fork/exec syscalls,
 in which case even -L1 might make sense.
I did the command above with md5sum as the command, and got missing lines in 
the output.  I optimistically hoped that would not happen!

 mv and ln also accept the --target-directory=dest option.

  Simlarly, ls -l . will readdir(), and then stat() each file in the
  directory. On a filesystem with high latency, it would be faster to issue
  the stat() calls asynchronously, and in parallel, and then collect the
  results for

 If you can demonstrate a large performance gain on
 systems that many people use, then maybe...

 There is more than a little value in keeping programs
 like those in the coreutils package relatively simple,
 but if the cost(maintenance+portability burden)/benefit
 ratio is low enough, then anything is possible.

 For example, a well-encapsulated, optionally-threaded
 stat_all_dir_entries API might be useful in some situations.
So a relatively small change for parallel stat() in ls could fly.

 If getting any eventual patch into upstream coreutils is
 important to you, be sure there is some consensus on this
 list before doing a lot of work on it.
Any ideas on how to do a parallel cp / mv in a way that is not Considered 
Harmful?  Maybe prefetch_files(max_bytes,file1,...,NULL) ... aargh.

  display.  (This could improve performance for NFS, in proportion to the
  latency and the number of threads.)
 
 
  Question:  Is there already a set of improved utilities that implement
  this kind of technique?

 Not that I know of.

  If not, would this kind of performance enhancements be
  considered useful?

 It's impossible to say without knowing more.

On the (de?)merits of xargs for parallel processing:

What would you expect this to do --:

    find -type f -print0 | 
xargs -0 -n 8 --max-procs=16 md5sum  ~/md5sums

    sort -k2  md5sums  md5sums.sorted

Compared to this?

    find -type f -print0 | 
xargs -0                     md5sum  ~/md5sums

    sort -k2  md5sums  md5sums.sorted

I was a little surprised that on my system running in parallel (the first 
version) loses around 1 line of output per thousand (md5sum of 22Gb in mostly 
small files).  

Is there a correct way to do md5sums in parallel without having a shared 
output buffer which eats output (I presume) -- or is losing output when 
haphazardly combining output streams actually strange and unusual?


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Threaded versions of cp, mv, ls for high latency / parallel filesystems?

2008-11-08 Thread Andrew McGill
Greetings coreutils folks,

There are a number of interesting filesystems (glusterfs, lustre? ... NFS) 
which could benefit from userspace utilities doing certain operatings in 
parallel.  (I have a very slow glusterfs installation that makes me think 
that some things can be done better.)

For example, copying a number of files is currently done in series ...
cp a b c d e f g h dest/
but, on certain filesystems, it would be roughly twice as efficient if 
implemented in two parallel threads, something like:
cp a c e g dest/ 
cp b d f h dest/
since the source and destination files can be stored on multiple physical 
volumes.  

Simlarly, ls -l . will readdir(), and then stat() each file in the directory.  
On a filesystem with high latency, it would be faster to issue the stat() 
calls asynchronously, and in parallel, and then collect the results for 
display.  (This could improve performance for NFS, in proportion to the 
latency and the number of threads.)


Question:  Is there already a set of improved utilities that implement this 
kind of technique?  If not, would this kind of performance enhancements be 
considered useful?  (It would mean introducing threading into programs which 
are currently single-threaded.)


To the user, it could look very much the same ...
export GNU_COREUTILS_THREADS=8
cp   # manipulate multiple files simultaneously
mv   # manipulate multiple files simultaneously
ls   # stat() multiple files simultaneously

One could also optimise the text utilities like cat by doing the open() and 
stat() operations in parallel and in the background -- userspace read-ahead 
caching.  All of the utilities which process mutliple files could get 
small speed boosts from this -- rm, cat, chown, chmod ... even tail, head, 
wc -- but probably only on network filesystems.

:-)


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils