Re: POSIX most likely will require a new -C option for 'sort'

2007-01-24 Thread Jim Meyering
Paul Eggert [EMAIL PROTECTED] wrote:
 http://www.opengroup.org/austin/mailarchives/ag/msg10185.html
 suggests that the next edition of POSIX will require 'sort' to support
 a new -C option.  There's no guarantee of this new requirement, but at
 this point I think we should probably just put in -C.  Here is a
 proposed patch.

 2007-01-21  Paul Eggert  [EMAIL PROTECTED]

   * NEWS: New option sort -C, proposed by XCU ERN 127, which looks
   like it will be approved.  Also add --check=quiet, --check=silent
   as long aliases, and --check=diagnose-first as an alias for -c.
   * doc/coreutils.texi (sort invocation): Document this.
   Also, mention that sort -c can take at most one file.
   * src/sort.c: Implement this.
   Include argmatch.h.
   (usage): Document the change.
   (CHECK_OPTION): New constant.
   (long_options): --check now takes an optional argument, and is now
   treated differently from 'c'.
   (check_args, check_types): New constant arrays.
   (check): New arg CHECKONLY, which suppresses diagnostic if -C.
   (main): Parse the new options.
   * tests/sort/Test.pm (02d, 02d, incompat5, incompat6):
   New tests for -C.

Thanks, Paul.
I've applied that.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: sort behavior - Ubuntu problem?

2007-01-24 Thread Bauke Jan Douma

Kevin Scannell wrote on 24-01-07 02:50:

I suspect that the behavior I describe below is caused by broken
locale definition files, but I wanted to get an expert opinion on this
before I go trying to find who maintains those upstream.

I know about the sort does not sort FAQ, and I don't think that I've
fallen into that trap, so please keep reading!

Anyway, here's a sample file, utf-8 encoded text.
http://borel.slu.edu/obair/test.txt

$ uname -a
Linux borel 2.6.17-10-generic #2 SMP Fri Oct 13 18:45:35 UTC 2006 i686 
GNU/Linux


$ sort --version
sort (GNU coreutils) 5.96
Copyright (C) 2006 Free Software Foundation, Inc.
This is free software.  You may redistribute copies of it under the 
terms of

the GNU General Public License http://www.gnu.org/licenses/gpl.html.
There is NO WARRANTY, to the extent permitted by law.

Written by Mike Haertel and Paul Eggert.

$ locale
LANG=
LC_CTYPE=en_US.utf8
LC_NUMERIC=en_US.utf8
LC_TIME=en_US.utf8
LC_COLLATE=en_US.utf8
LC_MONETARY=en_US.utf8
LC_MESSAGES=en_US.utf8
LC_PAPER=en_US.utf8
LC_NAME=en_US.utf8
LC_ADDRESS=en_US.utf8
LC_TELEPHONE=en_US.utf8
LC_MEASUREMENT=en_US.utf8
LC_IDENTIFICATION=en_US.utf8
LC_ALL=en_US.utf8

$ sort test.txt
a
á
áa
aá
az
áz
áa
aá

The acute-a collates after the a (correctly) except when there are
additional non-ASCII characters on the same line.   I see this also
with ga_IE.utf8 which is the locale I usually use, and the one I care
about.  This sort order is definitely wrong there.

The thing that leads me to believe that the problem lies with the
locale definition file is that on a different machine, running Gentoo,
same conditions as above, this file sorts as I want it to, in
dictionary order:

$ uname -a
Linux turing 2.6.17-gentoo-r4 #2 SMP Mon Aug 28 12:53:48 CDT 2006
x86_64 AMD Opteron(tm) Processor 246 AuthenticAMD GNU/Linux

$ sort test.txt
a
á
aá
áa
az
áz
aá
áa

Any advice would be appreciated.
Kevin



My advice is to first also do sort --version on the latter machine.
That might be a lead.

bjd



___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: feature request: gzip/bzip support for sort

2007-01-24 Thread Eric Blake
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

According to Jim Meyering on 1/24/2007 12:08 AM:
 Additionally, I'm probably going to change the documentation so that
 people will be less likely to depend on being able to run a separate
 program.  To be precise, I'd like to document that the only valid values
 of GNUSORT_COMPRESSOR are the empty string, gzip and bzip2[*].
 Then we will have the liberty to remove the exec calls and use library
 code instead, thus making the code a little more efficient -- but mainly,
 more robust.

Fair enough for now, as long as we leave ourselves an opening for future
expansion.  For example, the 7zip algorithm tends to produce smaller
compressed files than even bzip2 on typical input, and is patent
unencumbered, but my impression of 7zip is that it still does not behaves
as a filter compressor like gzip or bzip2, so it is not ready for
prime-time support yet.

 
 If someone makes a good case for allowing an arbitrary compressor, we can
 allow that later.  But if we were to add (and document) this feature now,
 we might well be stuck with it for a long time.
 
 [*] If gzip and bzip2 are good enough for tar, why should sort make any
 compromise (exec'ing some other program) in order to be more flexible?

New enough tar (1.16.1, for example), supports:
  --use-compress-program=PROG
 filter through PROG (must accept -d)
along with the builtin recognition of gzip and bzip2.  However, rather
than linking in libraries, it always exec's, even for the known two formats.

- --
Don't work too hard, make some time for fun as well!

Eric Blake [EMAIL PROTECTED]
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (Cygwin)
Comment: Public key at home.comcast.net/~ericblake/eblake.gpg
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFt1tn84KuGfSFAYARAqeoAJ9Zwstntak+XtKCRMgHwBaRWt7evgCgoypy
I8ymsCcFDib8l8wdzwpRROw=
=jnfm
-END PGP SIGNATURE-


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


dirname/basename bug

2007-01-24 Thread jl malet

hello
thoses 2 utils have problem when files/path contain '-' character which 
is interpreted as an option.
this addition of the gnu tools isn't in posix or the open group 
specification which have no option the best is to have no option

best regards
JL


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: dirname/basename bug

2007-01-24 Thread Eric Blake
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

According to jl malet on 1/24/2007 3:28 AM:
 hello
 thoses 2 utils have problem when files/path contain '-' character which
 is interpreted as an option.
 this addition of the gnu tools isn't in posix or the open group
 specification which have no option the best is to have no option
 best regards

Thanks for the report.  However, this is not a bug, but a misunderstanding
on your part of what POSIX requires.

'dirname -- -file' is the correct way to invoke dirname on something
starting with -.  POSIX does not require the support of any options, but
it ALSO does not forbid any options as extensions.  POSIX is quite clear
that except for a few special cases (such as test and echo), users of
utilities specified by POSIX must properly separate filenames from options
using -- if the filename could be interpreted as an option, because
implementations are allowed to add options as extensions.

- --
Don't work too hard, make some time for fun as well!

Eric Blake [EMAIL PROTECTED]
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (Cygwin)
Comment: Public key at home.comcast.net/~ericblake/eblake.gpg
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFt2z584KuGfSFAYARAoUFAJ0by8wHpPnRKbjhYNv6BnDuErSLPwCeJQNX
x9LOHY/LM4y5Bt967GrLIAU=
=+gX2
-END PGP SIGNATURE-


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: sort works wrong after 117452 entries

2007-01-24 Thread Pádraig Brady
Kreuzer IT Support wrote:
 Hello there, 
  
 I have a backup script running which uses sort in order to process those
 files first which are modified last. 
  
 I use find (pruning several paths) as input for afio, here for the filelist:
 
  
  find / -path /proc -prune -o -path /tmp -prune -o -path /opt -prune -o
 -path /usr/src -prune -o -path /dev/pts -prune -o -path /dev/shm -prune -o
 -path /daten/tmp -prune -o -path /daten/backup -prune -o -path
 /daten/install -prune -o -path /daten/src -prune -o -printf '%p; [EMAIL 
 PROTECTED]' |
 sort +1 -n -r  /daten/tmp/filelist.txt
 
 You can see the result in http://www.kreuzer-it.com/filelist.txt 
  
 As you will notice, the sort for column +1 does it's job only until line
 117452. Afterwards, the files are assorted. 

I haven't looked at the file in detail, but a couple of suggestions:

1. Use the sort syntax as implemented in:
http://www.pixelbeat.org/scripts/newest

2. If that doesn't sort it (pardon the pun)
put a LANG=C in front of the sort command to
try to eliminate possible locale issues.

Pádraig.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: sort behavior - Ubuntu problem?

2007-01-24 Thread Kevin Scannell

On 1/24/07, Bauke Jan Douma [EMAIL PROTECTED] wrote:

My advice is to first also do sort --version on the latter machine.
That might be a lead.



I'm sorry for leaving that out.  The second machine was running
coreutils-5.96, same as the first one.  More significantly, the sort
order that it gives:
a
á
aá
áa
az
áz
aá
áa

is the order I've seen on all unix/linux machines I've used in the
past, and with all versions of coreutils (I'm the Irish localizer so
I've installed just about every release since 5.0).

Can anyone with a Debian-like distribution reproduce the strange sort
order I'm seeing?
a
á
áa
aá
az
áz
áa
aá

Kevin
___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: sort works wrong after 117452 entries

2007-01-24 Thread Andreas Schwab
Kreuzer IT Support [EMAIL PROTECTED] writes:

 As you will notice, the sort for column +1 does it's job only until line
 117452.

Which is the last line where the file name contains no space.  The sort
keys are obtained by splitting the line on whitespace by default.

 Afterwards, the files are assorted. 

The lines are still correctly sorted on the sort key, it's just not the
key you expect.  Better put the sort key in the first field, or split the
line on the semicolon.  That makes it unabiguous.

Andreas.

-- 
Andreas Schwab, SuSE Labs, [EMAIL PROTECTED]
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
And now for something completely different.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


AW: sort works wrong after 117452 entries

2007-01-24 Thread Kreuzer IT Support

switching the columns and sorting for field +0 helped. 
so many thanks! 

i used ; for a later use of cut - missed to tell sort that key (don't know
if thats possible). 
thanks a lot, 
niko

-Ursprüngliche Nachricht-
Von: Andreas Schwab [mailto:[EMAIL PROTECTED] 
Gesendet: Mittwoch, 24. Jänner 2007 15:40
An: Kreuzer IT Support
Cc: bug-coreutils@gnu.org
Betreff: Re: sort works wrong after 117452 entries

Kreuzer IT Support [EMAIL PROTECTED] writes:

 As you will notice, the sort for column +1 does it's job only until 
 line 117452.

Which is the last line where the file name contains no space.  The sort keys
are obtained by splitting the line on whitespace by default.

 Afterwards, the files are assorted. 

The lines are still correctly sorted on the sort key, it's just not the key
you expect.  Better put the sort key in the first field, or split the line
on the semicolon.  That makes it unabiguous.

Andreas.

--
Andreas Schwab, SuSE Labs, [EMAIL PROTECTED] SuSE Linux Products GmbH,
Maxfeldstraße 5, 90409 Nürnberg, Germany PGP key fingerprint = 58CA 54C7
6D53 942B 1756  01D3 44D5 214B 8276 4ED5 And now for something completely
different.




___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: sort behavior - Ubuntu problem?

2007-01-24 Thread Paul Eggert
Kevin Scannell [EMAIL PROTECTED] writes:

 Can anyone with a Debian-like distribution reproduce the strange sort
 order I'm seeing?
 a
 á
 áa
 aá
 az
 áz
 áa
 aá

I can't, with Debian stable x86.  I get the order you expect.

   $ /usr/bin/sort test.txt
   a
   á
   aá
   áa
   az
   áz
   aá
   áa
   $ /usr/bin/sort --version | head -n1
   sort (coreutils) 5.2.1
   $ locale
   LANG=POSIX
   LC_CTYPE=en_US.UTF-8
   LC_NUMERIC=en_US.UTF-8
   LC_TIME=en_US.UTF-8
   LC_COLLATE=en_US.UTF-8
   LC_MONETARY=en_US.UTF-8
   LC_MESSAGES=en_US.UTF-8
   LC_PAPER=en_US.UTF-8
   LC_NAME=en_US.UTF-8
   LC_ADDRESS=en_US.UTF-8
   LC_TELEPHONE=en_US.UTF-8
   LC_MEASUREMENT=en_US.UTF-8
   LC_IDENTIFICATION=en_US.UTF-8
   LC_ALL=en_US.UTF-8

I get the same order with coreutils 6.7 as well.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: feature request: gzip/bzip support for sort

2007-01-24 Thread Paul Eggert
Jim Meyering [EMAIL PROTECTED] writes:

 I'm probably going to change the documentation so that
 people will be less likely to depend on being able to run
 a separate program.  To be precise, I'd like to document
 that the only valid values of GNUSORT_COMPRESSOR are the
 empty string, gzip and bzip2[*].

This sounds extreme, particularly since gzip and bzip2 are
not the best algorithms for 'sort' compression, where you
want a fast compressor.  Better choices right now would
include include lzop http://www.lzop.org/ and maybe
QuickLZ http://www.quicklz.com/.

The fast-compressor field is moving fairly rapidly.
(I've heard some rumors from some of my commercial friends.)
QuickLZ, a new algorithm, is at the top of the
maximumcompression list right now for fast compressors; see
http://www.maximumcompression.com/data/summary_mf3.php.
I would not be surprised to see a new champ next year.

 Then we will have the liberty to remove the exec calls and use library
 code instead, thus making the code a little more efficient -- but mainly,
 more robust.

It's not clear to me that it'll be more efficient for the
soon-to-be common case of multicore chips, since 'sort' and
the compressor can run in parallel.  We'll have to measure.
I agree about the robustness but that should be up to the user.

Perhaps we could put in something that says, If the
compressor is named 'gzip' we may optimize that. and
similarly for 'lzop' and/or a few other compressor names.
Or, more generally, we could have the convention that if the
compressor name starts with - we will strip the - and
then try to optimize the result if we can.  Something like
that, anyway.

 [*] If gzip and bzip2 are good enough for tar, why should sort make any
 compromise (exec'ing some other program) in order to be more flexible?

For 'sort' the tradeoff is different than for 'tar'.  We
don't particularly care if the format is stable, since it's
throwaway.  And we want fast compression, whereas people
generating tarballs often are willing to have way slower
compression for a slightly higher compression ratio.  (Plus,
new versions of 'tar' allow arbitrary compressors anyway.)


I do have a suggestion: we shouldn't use an environment
variable to select a compressor.  It should just be an
option.  Environment variables are funny beasts and it's
better to avoid them if we can.  I'll construct a patch
along those lines if you like.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: dirname/basename bug

2007-01-24 Thread jl malet
thanks for the precision I did saw that part of posix specifying the 
-- behaviour (I thought it was an extension of certain gnu tools)

I keep that in mind
best regards
JLM

Eric Blake wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

According to jl malet on 1/24/2007 3:28 AM:
  

hello
thoses 2 utils have problem when files/path contain '-' character which
is interpreted as an option.
this addition of the gnu tools isn't in posix or the open group
specification which have no option the best is to have no option
best regards



Thanks for the report.  However, this is not a bug, but a misunderstanding
on your part of what POSIX requires.

'dirname -- -file' is the correct way to invoke dirname on something
starting with -.  POSIX does not require the support of any options, but
it ALSO does not forbid any options as extensions.  POSIX is quite clear
that except for a few special cases (such as test and echo), users of
utilities specified by POSIX must properly separate filenames from options
using -- if the filename could be interpreted as an option, because
implementations are allowed to add options as extensions.

- --
Don't work too hard, make some time for fun as well!

Eric Blake [EMAIL PROTECTED]
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (Cygwin)
Comment: Public key at home.comcast.net/~ericblake/eblake.gpg
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFt2z584KuGfSFAYARAoUFAJ0by8wHpPnRKbjhYNv6BnDuErSLPwCeJQNX
x9LOHY/LM4y5Bt967GrLIAU=
=+gX2
-END PGP SIGNATURE-
  




___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: sort behavior - Ubuntu problem?

2007-01-24 Thread Kevin Scannell

On 1/24/07, The Wanderer [EMAIL PROTECTED] wrote:

Paul Eggert wrote:

 Kevin Scannell [EMAIL PROTECTED] writes:

 Can anyone with a Debian-like distribution reproduce the strange
 sort order I'm seeing?
 a
 á
 áa
 aá
 az
 áz
 áa
 aá

 I can't, with Debian stable x86.  I get the order you expect.

I can, with a mix of Debian unstable and testing, also x86.



Paul, Wanderer,
 I'm grateful for the tests.  At least now I know I'm not totally
crazy.  I did a bit more testing this afternoon.
  First, I found the same problem with later versions of coreutils,
including 6.7.  Then, since I suspected a locale definition bug, I
copied the locale source file that defines LC_COLLATE
(/usr/share/i18n/locales/iso14651_t1) from my Gentoo box where sort
works, to the broken Ubuntu box and reran locale-gen.   The were
quite a few differences between the files so I was hopeful that this
might do the trick, but unfortunately it didn't help.

  One thing I can say for sure is that strcoll is broken.   I wrote a
10 line C program that sets up the utf-8 strings aá (0x61,0xc3,0xa1)
and áa (0xc3,0xa1,0x61) explicitly and then outputs the return value
of strcoll.  It definitely thinks áa should be first, which is bad.

Browsing around the sort source code, strcoll seems to be the heart
of the matter (by way of xmemcoll and memcoll) - please correct me if
I'm wrong.

  Wanderer, could you tell me what version of glibc you have?  Here's mine:
ii  libc6-dev  2.4-1ubuntu12  GNU C Library: Development Libraries and Hea

Thanks again for the help - I'll try and sort it out with the glibc
developers, or maybe by looking carefully at the recent Debian/Ubuntu
patches.

-Kevin
___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: feature request: gzip/bzip support for sort

2007-01-24 Thread Craig Macdonald
Firstly, I wanted to say that I am exited by the extremely fast progress 
that has been made in sort for compressed temporary files.

Many thanks to Dan and others for the implementation.
(I've failed to accomplish the bootstrap of the CVS sources - are there 
bootstrapped snapshots

available anywhere?)

For 'sort' the tradeoff is different than for 'tar'.  We
don't particularly care if the format is stable, since it's
throwaway.  And we want fast compression, whereas people
generating tarballs often are willing to have way slower
compression for a slightly higher compression ratio.  (Plus,
new versions of 'tar' allow arbitrary compressors anyway.)

  

Now that we have the ability to fork decompression processes, are
we likely to see sort have the ability to open gzipped(or bzip2ed) files?
For sorting a stream of compressed, this is obviously not required, but 
for merging,

this would reduce a substantial mess with zcats to fifos etc etc.
However, I'd understand if it was decided not to, because unlike the 
temporary

files, there is an existing workable solution.

Many thanks

Craig Macdonald


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils