Feature request: handling space-delimited data with cut
A novice will cut out a list of PIDs like this: ps uaxw | grep nobody | cut -f 2 # huh? ps uaxw | grep nobody | cut -f 2 -d ' ' # oh, not tabs, but spaces. huh? ps uaxw | grep nobody | cut -f 4 -d ' ' # 'nobody' is followed by 3 spaces After consulting 3 friends and 6 linux mailing lists about that one random line it prints out, they learn that the canonical ways of doing this simple task are all just horrid: ps uaxw | grep nobody | tr -s ' ' | cut -f 4 -d ' ' ps uaxw | grep nobody | awk '{print $2}' ps uaxw | grep nobody | while read x y z; do echo $y; done There are two problems: * Input is not tab delimited. It might once have been, but currently the most interesting input is space delimited. * There is no SIMPLE tool for handling space delimited data ({'$awk'} is not simple, IMHO) There are a few ways of fixing this: 1. Adding an option (-w) similar to -f to cut words separated by whitespace (same rules as sort, unless the whitespace is changed to something else with -d) ps uaxwf | grep nobody | cut -w 2 ps uaxwf | grep nobody | cut --words 2 2. Adding a switch (-w) similar to set the delimiter to multiple whitespace ps uaxwf | grep nobody | cut -w -f 2 ps uaxwf | grep nobody | cut --whitespace -f 2 3. Adding an option (-m) to merge delimiters, similar to what tr -s does, but without having to specify the delimiter twice on the command line ps uaxwf | grep nobody | cut -m -d ' ' -f 2 ps uaxwf | grep nobody | cut --merge-delimiters -d ' ' -f 2 4. Adding out of band meta information about the input stream, thereby violating everything unixy, and condemning generations of children not to understand that the scissors they are running with are sharp: ps uaxwf | grep nobody | cut -f 2 It could be called Alternate Data Streams and Extended Attributes 5. Embark on a campaign of education and training so that the true path to awk and tr enlightenment will be known to all. I'm in favour of cut -w ... can I send a patch? :-) ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: option abbreviation exceptions
On Tuesday 30 December 2008 15:00:18 Eric Blake wrote: According to Pádraig Brady on 12/30/2008 2:46 AM: Usage: truncate [OPTION]... [FILE]... Is supporting stdin a useful enhancement? er ... Maybe if you can get the shell to open different files based on some condition, though again that seems a little contrived. if cond ; then foo=file1 else foo=file2 fi truncate -s0 $foo This redirection is wonderful, but entirely counter-intuitive. By convention stdout is where writes occur, stdin is where reads occur. Modifying the file given as stdin is just a little unexpected. For good measure (all?) shells open stdin as read-only, which makes the operation fail -- ftruncate(0,0) gives invalid argument. The redirection you need for a writable stdin under bash seems to be this one: truncate -s$SIZE 0$foo :-) ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: Threaded versions of cp, mv, ls for high latency / parallel filesystems?
On Thursday 13 November 2008 14:52:44 Ralf Wildenhues wrote: Hello Andrew, Andrew McGill list2008 at lunch.za.net writes: find -type f -print0 | xargs -0 -n 8 --max-procs=16 md5sum ~/md5sums sort -k2 md5sums md5sums.sorted To avoid losing output, use append mode for writing: : ~/md5sums find -type f -print0 | xargs -0 -n 8 --max-procs=16 md5sum ~/md5sums 21 sort -k2 md5sums md5sums.sorted This just recently came up in Autoconf: http://thread.gmane.org/gmane.comp.shells.bash.bugs/11958 Ah! I see! So without O_APPEND, things don't work quite right. At the risk of drifting off topic - is there ever a benefit in the shell implementing a -redirection with just O_TRUNC , rather than O_TRUNC | O_APPEND ? Does the output process ever need to seek() back in stdout? (If this off topic, please feel free to flame me, and/or direct me to the correct forum -- but I did freely send a bug report to the bash folks, even though I'll bet they're not alone in omitting O_APPEND with O_TRUNC). :-) ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: Threaded versions of cp, mv, ls for high latency / parallel filesystems?
On Saturday 08 November 2008 20:05:25 Jim Meyering wrote: Andrew McGill [EMAIL PROTECTED] wrote: Greetings coreutils folks, There are a number of interesting filesystems (glusterfs, lustre? ... NFS) which could benefit from userspace utilities doing certain operatings in parallel. (I have a very slow glusterfs installation that makes me think that some things can be done better.) For example, copying a number of files is currently done in series ... cp a b c d e f g h dest/ but, on certain filesystems, it would be roughly twice as efficient if implemented in two parallel threads, something like: cp a c e g dest/ cp b d f h dest/ since the source and destination files can be stored on multiple physical volumes. How about parallelizing it via xargs, e.g., $ echo a b c d e f g h | xargs -t -n4 --no-run-if-empty \ --max-procs=2 -- cp --target-directory=dest cp --target-directory=dest a b c d cp --target-directory=dest e f g h Obviously the above is tailored (-L4) to your 8-input example. In practice, you'd use a larger number, unless latency is so high as to dwarf the cost of extra fork/exec syscalls, in which case even -L1 might make sense. I did the command above with md5sum as the command, and got missing lines in the output. I optimistically hoped that would not happen! mv and ln also accept the --target-directory=dest option. Simlarly, ls -l . will readdir(), and then stat() each file in the directory. On a filesystem with high latency, it would be faster to issue the stat() calls asynchronously, and in parallel, and then collect the results for If you can demonstrate a large performance gain on systems that many people use, then maybe... There is more than a little value in keeping programs like those in the coreutils package relatively simple, but if the cost(maintenance+portability burden)/benefit ratio is low enough, then anything is possible. For example, a well-encapsulated, optionally-threaded stat_all_dir_entries API might be useful in some situations. So a relatively small change for parallel stat() in ls could fly. If getting any eventual patch into upstream coreutils is important to you, be sure there is some consensus on this list before doing a lot of work on it. Any ideas on how to do a parallel cp / mv in a way that is not Considered Harmful? Maybe prefetch_files(max_bytes,file1,...,NULL) ... aargh. display. (This could improve performance for NFS, in proportion to the latency and the number of threads.) Question: Is there already a set of improved utilities that implement this kind of technique? Not that I know of. If not, would this kind of performance enhancements be considered useful? It's impossible to say without knowing more. On the (de?)merits of xargs for parallel processing: What would you expect this to do --: find -type f -print0 | xargs -0 -n 8 --max-procs=16 md5sum ~/md5sums sort -k2 md5sums md5sums.sorted Compared to this? find -type f -print0 | xargs -0 md5sum ~/md5sums sort -k2 md5sums md5sums.sorted I was a little surprised that on my system running in parallel (the first version) loses around 1 line of output per thousand (md5sum of 22Gb in mostly small files). Is there a correct way to do md5sums in parallel without having a shared output buffer which eats output (I presume) -- or is losing output when haphazardly combining output streams actually strange and unusual? ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Threaded versions of cp, mv, ls for high latency / parallel filesystems?
Greetings coreutils folks, There are a number of interesting filesystems (glusterfs, lustre? ... NFS) which could benefit from userspace utilities doing certain operatings in parallel. (I have a very slow glusterfs installation that makes me think that some things can be done better.) For example, copying a number of files is currently done in series ... cp a b c d e f g h dest/ but, on certain filesystems, it would be roughly twice as efficient if implemented in two parallel threads, something like: cp a c e g dest/ cp b d f h dest/ since the source and destination files can be stored on multiple physical volumes. Simlarly, ls -l . will readdir(), and then stat() each file in the directory. On a filesystem with high latency, it would be faster to issue the stat() calls asynchronously, and in parallel, and then collect the results for display. (This could improve performance for NFS, in proportion to the latency and the number of threads.) Question: Is there already a set of improved utilities that implement this kind of technique? If not, would this kind of performance enhancements be considered useful? (It would mean introducing threading into programs which are currently single-threaded.) To the user, it could look very much the same ... export GNU_COREUTILS_THREADS=8 cp # manipulate multiple files simultaneously mv # manipulate multiple files simultaneously ls # stat() multiple files simultaneously One could also optimise the text utilities like cat by doing the open() and stat() operations in parallel and in the background -- userspace read-ahead caching. All of the utilities which process mutliple files could get small speed boosts from this -- rm, cat, chown, chmod ... even tail, head, wc -- but probably only on network filesystems. :-) ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils