Re: POSIX most likely will require a new -C option for 'sort'
Paul Eggert [EMAIL PROTECTED] wrote: http://www.opengroup.org/austin/mailarchives/ag/msg10185.html suggests that the next edition of POSIX will require 'sort' to support a new -C option. There's no guarantee of this new requirement, but at this point I think we should probably just put in -C. Here is a proposed patch. 2007-01-21 Paul Eggert [EMAIL PROTECTED] * NEWS: New option sort -C, proposed by XCU ERN 127, which looks like it will be approved. Also add --check=quiet, --check=silent as long aliases, and --check=diagnose-first as an alias for -c. * doc/coreutils.texi (sort invocation): Document this. Also, mention that sort -c can take at most one file. * src/sort.c: Implement this. Include argmatch.h. (usage): Document the change. (CHECK_OPTION): New constant. (long_options): --check now takes an optional argument, and is now treated differently from 'c'. (check_args, check_types): New constant arrays. (check): New arg CHECKONLY, which suppresses diagnostic if -C. (main): Parse the new options. * tests/sort/Test.pm (02d, 02d, incompat5, incompat6): New tests for -C. Thanks, Paul. I've applied that. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: sort behavior - Ubuntu problem?
Kevin Scannell wrote on 24-01-07 02:50: I suspect that the behavior I describe below is caused by broken locale definition files, but I wanted to get an expert opinion on this before I go trying to find who maintains those upstream. I know about the sort does not sort FAQ, and I don't think that I've fallen into that trap, so please keep reading! Anyway, here's a sample file, utf-8 encoded text. http://borel.slu.edu/obair/test.txt $ uname -a Linux borel 2.6.17-10-generic #2 SMP Fri Oct 13 18:45:35 UTC 2006 i686 GNU/Linux $ sort --version sort (GNU coreutils) 5.96 Copyright (C) 2006 Free Software Foundation, Inc. This is free software. You may redistribute copies of it under the terms of the GNU General Public License http://www.gnu.org/licenses/gpl.html. There is NO WARRANTY, to the extent permitted by law. Written by Mike Haertel and Paul Eggert. $ locale LANG= LC_CTYPE=en_US.utf8 LC_NUMERIC=en_US.utf8 LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8 LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8 LC_PAPER=en_US.utf8 LC_NAME=en_US.utf8 LC_ADDRESS=en_US.utf8 LC_TELEPHONE=en_US.utf8 LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=en_US.utf8 LC_ALL=en_US.utf8 $ sort test.txt a á áa aá az áz áa aá The acute-a collates after the a (correctly) except when there are additional non-ASCII characters on the same line. I see this also with ga_IE.utf8 which is the locale I usually use, and the one I care about. This sort order is definitely wrong there. The thing that leads me to believe that the problem lies with the locale definition file is that on a different machine, running Gentoo, same conditions as above, this file sorts as I want it to, in dictionary order: $ uname -a Linux turing 2.6.17-gentoo-r4 #2 SMP Mon Aug 28 12:53:48 CDT 2006 x86_64 AMD Opteron(tm) Processor 246 AuthenticAMD GNU/Linux $ sort test.txt a á aá áa az áz aá áa Any advice would be appreciated. Kevin My advice is to first also do sort --version on the latter machine. That might be a lead. bjd ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: feature request: gzip/bzip support for sort
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 According to Jim Meyering on 1/24/2007 12:08 AM: Additionally, I'm probably going to change the documentation so that people will be less likely to depend on being able to run a separate program. To be precise, I'd like to document that the only valid values of GNUSORT_COMPRESSOR are the empty string, gzip and bzip2[*]. Then we will have the liberty to remove the exec calls and use library code instead, thus making the code a little more efficient -- but mainly, more robust. Fair enough for now, as long as we leave ourselves an opening for future expansion. For example, the 7zip algorithm tends to produce smaller compressed files than even bzip2 on typical input, and is patent unencumbered, but my impression of 7zip is that it still does not behaves as a filter compressor like gzip or bzip2, so it is not ready for prime-time support yet. If someone makes a good case for allowing an arbitrary compressor, we can allow that later. But if we were to add (and document) this feature now, we might well be stuck with it for a long time. [*] If gzip and bzip2 are good enough for tar, why should sort make any compromise (exec'ing some other program) in order to be more flexible? New enough tar (1.16.1, for example), supports: --use-compress-program=PROG filter through PROG (must accept -d) along with the builtin recognition of gzip and bzip2. However, rather than linking in libraries, it always exec's, even for the known two formats. - -- Don't work too hard, make some time for fun as well! Eric Blake [EMAIL PROTECTED] -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (Cygwin) Comment: Public key at home.comcast.net/~ericblake/eblake.gpg Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFt1tn84KuGfSFAYARAqeoAJ9Zwstntak+XtKCRMgHwBaRWt7evgCgoypy I8ymsCcFDib8l8wdzwpRROw= =jnfm -END PGP SIGNATURE- ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
dirname/basename bug
hello thoses 2 utils have problem when files/path contain '-' character which is interpreted as an option. this addition of the gnu tools isn't in posix or the open group specification which have no option the best is to have no option best regards JL ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: dirname/basename bug
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 According to jl malet on 1/24/2007 3:28 AM: hello thoses 2 utils have problem when files/path contain '-' character which is interpreted as an option. this addition of the gnu tools isn't in posix or the open group specification which have no option the best is to have no option best regards Thanks for the report. However, this is not a bug, but a misunderstanding on your part of what POSIX requires. 'dirname -- -file' is the correct way to invoke dirname on something starting with -. POSIX does not require the support of any options, but it ALSO does not forbid any options as extensions. POSIX is quite clear that except for a few special cases (such as test and echo), users of utilities specified by POSIX must properly separate filenames from options using -- if the filename could be interpreted as an option, because implementations are allowed to add options as extensions. - -- Don't work too hard, make some time for fun as well! Eric Blake [EMAIL PROTECTED] -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (Cygwin) Comment: Public key at home.comcast.net/~ericblake/eblake.gpg Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFt2z584KuGfSFAYARAoUFAJ0by8wHpPnRKbjhYNv6BnDuErSLPwCeJQNX x9LOHY/LM4y5Bt967GrLIAU= =+gX2 -END PGP SIGNATURE- ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: sort works wrong after 117452 entries
Kreuzer IT Support wrote: Hello there, I have a backup script running which uses sort in order to process those files first which are modified last. I use find (pruning several paths) as input for afio, here for the filelist: find / -path /proc -prune -o -path /tmp -prune -o -path /opt -prune -o -path /usr/src -prune -o -path /dev/pts -prune -o -path /dev/shm -prune -o -path /daten/tmp -prune -o -path /daten/backup -prune -o -path /daten/install -prune -o -path /daten/src -prune -o -printf '%p; [EMAIL PROTECTED]' | sort +1 -n -r /daten/tmp/filelist.txt You can see the result in http://www.kreuzer-it.com/filelist.txt As you will notice, the sort for column +1 does it's job only until line 117452. Afterwards, the files are assorted. I haven't looked at the file in detail, but a couple of suggestions: 1. Use the sort syntax as implemented in: http://www.pixelbeat.org/scripts/newest 2. If that doesn't sort it (pardon the pun) put a LANG=C in front of the sort command to try to eliminate possible locale issues. Pádraig. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: sort behavior - Ubuntu problem?
On 1/24/07, Bauke Jan Douma [EMAIL PROTECTED] wrote: My advice is to first also do sort --version on the latter machine. That might be a lead. I'm sorry for leaving that out. The second machine was running coreutils-5.96, same as the first one. More significantly, the sort order that it gives: a á aá áa az áz aá áa is the order I've seen on all unix/linux machines I've used in the past, and with all versions of coreutils (I'm the Irish localizer so I've installed just about every release since 5.0). Can anyone with a Debian-like distribution reproduce the strange sort order I'm seeing? a á áa aá az áz áa aá Kevin ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: sort works wrong after 117452 entries
Kreuzer IT Support [EMAIL PROTECTED] writes: As you will notice, the sort for column +1 does it's job only until line 117452. Which is the last line where the file name contains no space. The sort keys are obtained by splitting the line on whitespace by default. Afterwards, the files are assorted. The lines are still correctly sorted on the sort key, it's just not the key you expect. Better put the sort key in the first field, or split the line on the semicolon. That makes it unabiguous. Andreas. -- Andreas Schwab, SuSE Labs, [EMAIL PROTECTED] SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 And now for something completely different. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
AW: sort works wrong after 117452 entries
switching the columns and sorting for field +0 helped. so many thanks! i used ; for a later use of cut - missed to tell sort that key (don't know if thats possible). thanks a lot, niko -Ursprüngliche Nachricht- Von: Andreas Schwab [mailto:[EMAIL PROTECTED] Gesendet: Mittwoch, 24. Jänner 2007 15:40 An: Kreuzer IT Support Cc: bug-coreutils@gnu.org Betreff: Re: sort works wrong after 117452 entries Kreuzer IT Support [EMAIL PROTECTED] writes: As you will notice, the sort for column +1 does it's job only until line 117452. Which is the last line where the file name contains no space. The sort keys are obtained by splitting the line on whitespace by default. Afterwards, the files are assorted. The lines are still correctly sorted on the sort key, it's just not the key you expect. Better put the sort key in the first field, or split the line on the semicolon. That makes it unabiguous. Andreas. -- Andreas Schwab, SuSE Labs, [EMAIL PROTECTED] SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 And now for something completely different. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: sort behavior - Ubuntu problem?
Kevin Scannell [EMAIL PROTECTED] writes: Can anyone with a Debian-like distribution reproduce the strange sort order I'm seeing? a á áa aá az áz áa aá I can't, with Debian stable x86. I get the order you expect. $ /usr/bin/sort test.txt a á aá áa az áz aá áa $ /usr/bin/sort --version | head -n1 sort (coreutils) 5.2.1 $ locale LANG=POSIX LC_CTYPE=en_US.UTF-8 LC_NUMERIC=en_US.UTF-8 LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8 LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8 LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8 LC_ALL=en_US.UTF-8 I get the same order with coreutils 6.7 as well. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: feature request: gzip/bzip support for sort
Jim Meyering [EMAIL PROTECTED] writes: I'm probably going to change the documentation so that people will be less likely to depend on being able to run a separate program. To be precise, I'd like to document that the only valid values of GNUSORT_COMPRESSOR are the empty string, gzip and bzip2[*]. This sounds extreme, particularly since gzip and bzip2 are not the best algorithms for 'sort' compression, where you want a fast compressor. Better choices right now would include include lzop http://www.lzop.org/ and maybe QuickLZ http://www.quicklz.com/. The fast-compressor field is moving fairly rapidly. (I've heard some rumors from some of my commercial friends.) QuickLZ, a new algorithm, is at the top of the maximumcompression list right now for fast compressors; see http://www.maximumcompression.com/data/summary_mf3.php. I would not be surprised to see a new champ next year. Then we will have the liberty to remove the exec calls and use library code instead, thus making the code a little more efficient -- but mainly, more robust. It's not clear to me that it'll be more efficient for the soon-to-be common case of multicore chips, since 'sort' and the compressor can run in parallel. We'll have to measure. I agree about the robustness but that should be up to the user. Perhaps we could put in something that says, If the compressor is named 'gzip' we may optimize that. and similarly for 'lzop' and/or a few other compressor names. Or, more generally, we could have the convention that if the compressor name starts with - we will strip the - and then try to optimize the result if we can. Something like that, anyway. [*] If gzip and bzip2 are good enough for tar, why should sort make any compromise (exec'ing some other program) in order to be more flexible? For 'sort' the tradeoff is different than for 'tar'. We don't particularly care if the format is stable, since it's throwaway. And we want fast compression, whereas people generating tarballs often are willing to have way slower compression for a slightly higher compression ratio. (Plus, new versions of 'tar' allow arbitrary compressors anyway.) I do have a suggestion: we shouldn't use an environment variable to select a compressor. It should just be an option. Environment variables are funny beasts and it's better to avoid them if we can. I'll construct a patch along those lines if you like. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: dirname/basename bug
thanks for the precision I did saw that part of posix specifying the -- behaviour (I thought it was an extension of certain gnu tools) I keep that in mind best regards JLM Eric Blake wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 According to jl malet on 1/24/2007 3:28 AM: hello thoses 2 utils have problem when files/path contain '-' character which is interpreted as an option. this addition of the gnu tools isn't in posix or the open group specification which have no option the best is to have no option best regards Thanks for the report. However, this is not a bug, but a misunderstanding on your part of what POSIX requires. 'dirname -- -file' is the correct way to invoke dirname on something starting with -. POSIX does not require the support of any options, but it ALSO does not forbid any options as extensions. POSIX is quite clear that except for a few special cases (such as test and echo), users of utilities specified by POSIX must properly separate filenames from options using -- if the filename could be interpreted as an option, because implementations are allowed to add options as extensions. - -- Don't work too hard, make some time for fun as well! Eric Blake [EMAIL PROTECTED] -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (Cygwin) Comment: Public key at home.comcast.net/~ericblake/eblake.gpg Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFt2z584KuGfSFAYARAoUFAJ0by8wHpPnRKbjhYNv6BnDuErSLPwCeJQNX x9LOHY/LM4y5Bt967GrLIAU= =+gX2 -END PGP SIGNATURE- ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: sort behavior - Ubuntu problem?
On 1/24/07, The Wanderer [EMAIL PROTECTED] wrote: Paul Eggert wrote: Kevin Scannell [EMAIL PROTECTED] writes: Can anyone with a Debian-like distribution reproduce the strange sort order I'm seeing? a á áa aá az áz áa aá I can't, with Debian stable x86. I get the order you expect. I can, with a mix of Debian unstable and testing, also x86. Paul, Wanderer, I'm grateful for the tests. At least now I know I'm not totally crazy. I did a bit more testing this afternoon. First, I found the same problem with later versions of coreutils, including 6.7. Then, since I suspected a locale definition bug, I copied the locale source file that defines LC_COLLATE (/usr/share/i18n/locales/iso14651_t1) from my Gentoo box where sort works, to the broken Ubuntu box and reran locale-gen. The were quite a few differences between the files so I was hopeful that this might do the trick, but unfortunately it didn't help. One thing I can say for sure is that strcoll is broken. I wrote a 10 line C program that sets up the utf-8 strings aá (0x61,0xc3,0xa1) and áa (0xc3,0xa1,0x61) explicitly and then outputs the return value of strcoll. It definitely thinks áa should be first, which is bad. Browsing around the sort source code, strcoll seems to be the heart of the matter (by way of xmemcoll and memcoll) - please correct me if I'm wrong. Wanderer, could you tell me what version of glibc you have? Here's mine: ii libc6-dev 2.4-1ubuntu12 GNU C Library: Development Libraries and Hea Thanks again for the help - I'll try and sort it out with the glibc developers, or maybe by looking carefully at the recent Debian/Ubuntu patches. -Kevin ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: feature request: gzip/bzip support for sort
Firstly, I wanted to say that I am exited by the extremely fast progress that has been made in sort for compressed temporary files. Many thanks to Dan and others for the implementation. (I've failed to accomplish the bootstrap of the CVS sources - are there bootstrapped snapshots available anywhere?) For 'sort' the tradeoff is different than for 'tar'. We don't particularly care if the format is stable, since it's throwaway. And we want fast compression, whereas people generating tarballs often are willing to have way slower compression for a slightly higher compression ratio. (Plus, new versions of 'tar' allow arbitrary compressors anyway.) Now that we have the ability to fork decompression processes, are we likely to see sort have the ability to open gzipped(or bzip2ed) files? For sorting a stream of compressed, this is obviously not required, but for merging, this would reduce a substantial mess with zcats to fifos etc etc. However, I'd understand if it was decided not to, because unlike the temporary files, there is an existing workable solution. Many thanks Craig Macdonald ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils