Re: sort(1) updates

Jason McIntyre Fri, 11 Jul 2014 04:04:37 -0700

On Fri, Jul 11, 2014 at 12:25:01AM -0400, Jared Yanovich wrote:
> On Sun, Jul 06, 2014 at 09:03:17PM +0200, Otto Moerbeek wrote:
> 
> > > Alternatively we could just import the FreeBSD sort(1) rewrite from 2012.
> > 
> > Did you try to
> > port it? I won't have time the coming weeks, I'll be on vacation. 
> 
> Have fun!  When you get back, some notes:
> 
>  - there is a lot of fluff that I would guess is there simply for GNU sort(1)
>    compatibility, like -M (month sort) and -V (version number sort).
> 
>  - this version retains the parallel support (pthreads)
> 
>  - I retained the original -R (record separator) support instead of -R
>    for random
> 
> Some of the tests in our regress appear to be wrong (specifically the -b tests
> but also a few others).  Other than that, this new sort is faster against a 
> few
> quick workloads I whipped up.  Completes system 'make build' on amd64.
>


i do not think that we should just slap in freebsd's page willy nilly
(which i presume is what's happening here). i know that makes things
nice and easy for you, but i want to see a diff that just documents any
changes to current behaviour.

then we can look at updating bits of it where you think the freebsd text
is better than ours.

jmc

> Index: sort.1
> ===================================================================
> RCS file: /cvs/src/usr.bin/sort/sort.1,v
> retrieving revision 1.40
> diff -u -p -r1.40 sort.1
> --- sort.1    24 Aug 2013 22:18:05 -0000      1.40
> +++ sort.1    11 Jul 2014 04:07:07 -0000
> @@ -1,4 +1,5 @@
> -.\"  $OpenBSD: sort.1,v 1.40 2013/08/24 22:18:05 jmc Exp $
> +.\"  $OpenBSD: sort.1,v 1.31 2007/08/21 21:22:37 millert Exp $
> +.\"  $FreeBSD: head/usr.bin/sort/sort.1.in 264918 2014-04-25 15:27:19Z 
> bdrewery $
>  .\"
>  .\" Copyright (c) 1991, 1993
>  .\"  The Regents of the University of California.  All rights reserved.
> @@ -32,44 +33,46 @@
>  .\"
>  .\"     @(#)sort.1   8.1 (Berkeley) 6/6/93
>  .\"
> -.Dd $Mdocdate: August 24 2013 $
> +.Dd $Mdocdate$
>  .Dt SORT 1
>  .Os
>  .Sh NAME
>  .Nm sort
> -.Nd sort, merge, or sequence check text files
> +.Nd sort, merge, or sequence check text and binary files
>  .Sh SYNOPSIS
>  .Nm sort
> -.Op Fl bCcdfHimnrsuz
> +.Op Fl bCcdfghiMmnRrsuVz
>  .Sm off
>  .Op Fl k\ \& Ar field1 Op , Ar field2
>  .Sm on
>  .Op Fl o Ar output
> -.Op Fl R Ar char
> -.Bk -words
> +.Op Fl R Ar record-separator
> +.Op Fl S Ar memsize
>  .Op Fl T Ar dir
> -.Ek
>  .Op Fl t Ar char
> -.Op Ar
> +.Op Ar file ...
>  .Sh DESCRIPTION
>  The
>  .Nm
> -utility sorts text files by lines,
> -operating in one of three modes: sort, merge, or check.
> -In sort mode, the specified files are combined and sorted
> -by line.
> -Merge mode is the same as sort mode except that the input
> -files are assumed to be pre-sorted.
> -In check mode, a single input file is checked to ensure that
> -it is correctly sorted.
> -.Pp
> -Comparisons are based on one or more sort keys extracted
> -from each line of input, and are performed lexicographically.
> +utility sorts text and binary files by lines.
> +A line is a record separated from the subsequent record by a
> +newline (default) or NUL
> +.Sq \e0
> +character
> +.Po Fl z
> +option
> +.Pc .
> +A record can contain any printable or unprintable characters.
> +Comparisons are based on one or more sort keys extracted from
> +each line of input, and are performed lexicographically,
> +according to the current locale's collating rules and the
> +specified command-line options that can tune the actual
> +sorting behavior.
>  By default, if keys are not given,
>  .Nm
> -regards each input line as a single field.
> +uses entire lines for comparison.
>  .Pp
> -The options are as follows:
> +The command line options are as follows:
>  .Bl -tag -width Ds
>  .It Fl C
>  Check that the single input file is sorted.
> @@ -82,123 +85,159 @@ but additionally write a message to
>  .Em stderr
>  if the input file is not sorted.
>  .It Fl m
> -Merge only; the input files are assumed to be pre-sorted.
> -This option is overridden by the
> -.Fl C
> -or
> -.Fl c
> -options,
> -if they are also present.
> +Merge only.
> +The input files are assumed to be pre-sorted.
> +If they are not sorted the output order is undefined.
>  .It Fl o Ar output
> -The argument given is the name of an
> +Print the output to the
>  .Ar output
> -file to be used instead of the standard output.
> -This file can be the same as one of the input files.
> -.It Fl T Ar dir
> +file instead of the standard output.
> +.It Fl S Ar size
>  Use
> -.Ar dir
> -as the directory for temporary files.
> -The default is the contents of the environment variable
> +.Ar size
> +for the maximum size of the memory buffer.
> +Size modifiers %,b,K,M,G,T,P,E,Z,Y can be used.
> +If a memory limit is not explicitly specified,
> +.Nm
> +takes up to about 90% of available memory.
> +If the file size is too big to fit into the memory buffer,
> +the temporary disk files are used to perform the sorting.
> +.It Fl T Ar dir
> +Store temporary files in the directory
> +.Ar dir .
> +The default path is the value of the environment variable
>  .Ev TMPDIR
>  or
>  .Pa /var/tmp
>  if
>  .Ev TMPDIR
> -does not exist.
> +is not defined.
>  .It Fl u
> -Unique: suppress all but one in each set of lines having equal keys.
> -If used with the
> -.Fl C
> -or
> +Unique keys.
> +Suppress all lines that have a key that is equal to an already
> +processed one.
> +This option, similarly to
> +.Fl s ,
> +implies a stable sort.
> +If used with
>  .Fl c
> -options, also check that there are no lines with duplicate keys.
> -.El
> -.Pp
> -The following options override the default ordering rules globally:
> -.Bl -tag -width indent
> -.It Fl H
> -Use a merge sort instead of a radix sort.
> -This option should be used for files larger than 60MB.
> -.It Fl s
> -Enable stable sort.
> -Uses additional resources (see
> -.Xr sradixsort 3 ) .
> +or
> +.Fl C ,
> +.Nm
> +also checks that there are no lines with duplicate keys.
>  .El
>  .Pp
>  The following options override the default ordering rules.
> -If ordering options appear before the first
> -.Fl k
> -option, they apply globally to all sort keys.
> +When ordering options appear independently of key field
> +specifications, they apply globally to all sort keys.
>  When attached to a specific key (see
>  .Fl k ) ,
> -the ordering options override
> -all global ordering options for that key.
> -Note that the ordering options intended to apply globally should not
> -appear after
> -.Fl k
> -or results may be unexpected.
> +the ordering options override all global ordering options for
> +the key they are attached to.
>  .Bl -tag -width indent
> +.It Fl b
> +Ignore leading blank characters when comparing lines.
>  .It Fl d
> -Only blank space and alphanumeric characters
> -.\" according
> -.\" to the current setting of LC_CTYPE
> -are used in making comparisons.
> +Consider only blank spaces and alphanumeric characters in comparisons.
>  .It Fl f
> -Considers all lowercase characters that have uppercase
> -equivalents to be the same for purposes of comparison.
> +Convert all lowercase characters to their uppercase equivalent
> +before comparison, that is, perform case-independent sorting
> +.Pq Dq case folding .
> +.It Fl g
> +Sort by general numerical value.
> +As opposed to
> +.Fl n ,
> +this option handles general floating points, which have a much
> +permissive format than those allowed by
> +. Fl n ,
> +but it has a significant performance drawback.
> +.It Fl H
> +Use
> +.Xr mergesort 3 .
> +This is a universal algorithm that can always be used with a guarantee
> +on the worst case upper bound on performance but it is not always the
> +fastest.
> +.It Fl h
> +Sort by numerical value, but take into account the SI suffix,
> +if present.
> +Sort first by numeric sign (negative, zero, or
> +positive); then by SI suffix (either empty, or `k' or `K', or one
> +of `MGTPEZY', in that order); and finally by numeric value.
> +The SI suffix must immediately follow the number.
> +For example, '12345K' sorts before '1M', because M is "larger" than K.
>  .It Fl i
>  Ignore all non-printable characters.
> +.It Fl M
> +Sort by month abbreviations.
> +Unknown strings are considered smaller than the month names.
>  .It Fl n
> -An initial numeric string, consisting of optional blank space, optional
> -minus sign, and zero or more digits (including decimal point)
> -.\" with
> -.\" optional radix character and thousands
> -.\" separator
> -.\" (as defined in the current locale),
> -is sorted by arithmetic value.
> -(The
> -.Fl n
> -option no longer implies the
> -.Fl b
> -option.)
> +Sort fields numerically by arithmetic value.
> +Fields are supposed to have optional blanks in the beginning, an
> +optional minus sign, zero or more digits (including decimal point and
> +possible thousand separators).
>  .It Fl r
> -Reverse the sense of comparisons.
> +Sort in reverse order.
> +.It Fl s
> +Stable sort.
> +This option maintains the original record order of records that have
> +and equal key.
> +This may use additional memory.
> +.It Fl V
> +Sort version numbers.
> +The input lines are treated as file names in form
> +.Ar PREFIX VERSION SUFFIX ,
> +where
> +.Ar SUFFIX
> +matches the regular expression
> +"(\.([A-Za-z~][A-Za-z0-9~]*)?)*".
> +The files are compared by their prefixes and versions (leading
> +zeros are ignored in version numbers, see example below).
> +If an input string does not match the pattern, then it is compared
> +using the byte compare function.
> +All string comparisons are performed in C locale, the locale
> +environment setting is ignored.
> +.Pp
> +Example:
> +.Bd -literal -offset 5n
> +$ ls sort* | sort -V
> +sort-1.022.tgz
> +sort-1.23.tgz
> +sort-1.23.1.tgz
> +sort-1.024.tgz
> +sort-1.024.003.
> +sort-1.024.003.tgz
> +sort-1.024.07.tgz
> +sort-1.024.009.tgz
> +.Ed
>  .El
>  .Pp
>  The treatment of field separators can be altered using these options:
>  .Bl -tag -width indent
>  .It Fl b
> -Ignores leading blank space when determining the start
> -and end of a restricted sort key.
> -A
> +Ignore leading blank space when determining the start
> +and end of a restricted sort key
> +.Pq see Fl k .
> +If
>  .Fl b
> -option specified before the first
> +is specified before the first
>  .Fl k
> -option applies globally to all
> -.Fl k
> -options.
> -Otherwise, the
> +option, it applies globally to all key specifications.
> +Otherwise,
>  .Fl b
> -option can be attached independently to each
> +can be attached independently to each
>  .Ar field
> -argument of the
> -.Fl k
> -option (see below).
> -Note that
> -.Fl b
> -should not appear after
> -.Fl k ,
> -and that it has no effect unless key fields are specified.
> -.It Fl R Ar char
> -.Ar char
> -is used as the record separator character.
> +argument of the key specifications.
> +.It Fl R Ar str
> +.Ar str
> +is used as the record separator.
>  This should be used with discretion;
>  .Fl R Aq Ar alphanumeric
>  usually produces undesirable results.
>  The default record separator is newline.
>  .It Fl t Ar char
> +Use
>  .Ar char
> -is used as the field separator character.
> +as the field separator character.
>  The initial
>  .Ar char
>  is not considered to be part of a field when determining key offsets.
> @@ -210,13 +249,21 @@ delimits an empty field).
>  If
>  .Fl t
>  is not specified, the default field separator is a sequence of
> -blank-space characters, and consecutive blank spaces do
> +blank space characters, and consecutive blank spaces do
>  .Em not
> -delimit an empty field; further, the initial blank space
> +delimit an empty field, however, the initial blank space
>  .Em is
>  considered part of a field when determining key offsets.
> +To use NUL as field separator, use
> +.Fl t
> +.Sq \e0 .
>  .It Fl z
> -Uses the nul character as the record separator.
> +Use NUL as record separator.
> +By default, records in the files are supposed to be separated by
> +the newline characters.
> +With this option, NUL
> +.Pq Sq \e0
> +is used as a record separator character.
>  .El
>  .Pp
>  Sort keys are specified with:
> @@ -226,21 +273,102 @@ Sort keys are specified with:
>  .Fl k\ \& Ar field1 Op , Ar field2
>  .Sm on
>  .Xc
> -Designates the starting position,
> +Define a restricted sort key that has the starting position
>  .Ar field1 ,
> -and optional ending position,
> -.Ar field2 ,
> +and optional ending position
> +.Ar field2
>  of a key field.
>  The
>  .Fl k
>  option may be specified multiple times,
> -in which case subsequent keys are compared after earlier keys compare equal.
> -The
> -.Fl k
> -option replaces the obsolescent options
> -.Cm \(pl Ns Ar pos1
> +in which case subsequent keys are compared when earlier keys compare equal.
> +.El
> +.Pp
> +Other options:
> +.Bl -tag -width indent
> +.It Fl Fl batch-size Ns = Ns Ar num
> +Specify maximum number of files that can be opened by
> +.Nm
> +at once.
> +This option affects behavior when having many input files or using
> +temporary files.
> +The default value is 16.
> +.It Fl Fl compress-program Ns = Ns Ar program
> +Use
> +.Ar program
> +to compress temporary files.
> +.Ar program
> +must compress standard input to standard output, when called
> +without arguments.
> +When called with argument
> +.Fl d
> +it must decompress standard input to standard output.
> +If
> +.Ar program
> +fails,
> +.Nm
> +exits with error.
> +An example of
> +.Ar program
> +that can be used here is bzip2.
> +.It Fl Fl files0-from Ns = Ns Ar filename
> +Take the input file list from the file
> +.Ar filename.
> +The file names must be separated by NUL characters
> +(e.g. from the command
> +.Cm find ... -print0 ) .
> +.It Fl Fl heapsort
> +Try to use heap sort, if the sort specifications allow.
> +This sort algorithm cannot be used with
> +.Fl u
> +and
> +.Fl s .
> +.It Fl Fl mmap
> +Use
> +.Xr mmap 2
> +instead of
> +.Xr stdio 3
> +for file I/O, which may improve performance in some cases.
> +.It Fl Fl parallel Ns = Ns Ar N
> +Set the maximum number of execution threads.
> +Defaults to the number of CPUs available.
> +.It Fl Fl qsort
> +Try to use quick sort, if the sort specifications allow.
> +This sort algorithm cannot be used with
> +.Fl u
>  and
> -.Fl Ns Ar pos2 .
> +.Fl s .
> +.It Fl Fl radixsort
> +Try to use radix sort, if the sort specifications allow.
> +The radix sort can only be used for trivial locales (C and POSIX),
> +and it cannot be used for numeric or month sort.
> +Radix sort is very fast and uses no additional memory when a stable sort
> +is not required.
> +Note:
> +.Fl R
> +is ignored with radix sort.
> +.It Fl Fl random-sort
> +Sort by a random order.
> +This is a random permutation of the inputs except that
> +the equal keys sort together.
> +It is implemented by hashing the input keys and sorting
> +the hash values.
> +The hash function is chosen randomly.
> +The hash function is randomized by
> +.Pa /dev/random
> +content, or by file content if it is specified by
> +.Fl Fl random-source .
> +Even if multiple sort fields are specified,
> +the same random hash function is used for all of them.
> +.It Fl Fl random-source Ns = Ns Ar filename
> +In random sort, the file content is used as the source of the 'seed' data
> +for the hash function choice.
> +Two invocations of random sort with the same seed data will use
> +the same hash function and will produce the same result if the input is
> +also identical.
> +By default, file
> +.Pa /dev/random
> +is used.
>  .El
>  .Pp
>  The following operands are available:
> @@ -257,8 +385,7 @@ the standard input is used.
>  .El
>  .Pp
>  A field is defined as a maximal sequence of characters other than the
> -field separator and record separator
> -.Pq newline by default .
> +field separator and record separator (newline by default).
>  Initial blank spaces are included in the field unless
>  .Fl b
>  has been specified;
> @@ -266,17 +393,17 @@ the first blank space of a sequence of b
>  separator and is included in the field (unless
>  .Fl t
>  is specified).
> -For example, by default all blank spaces at the beginning of a line are
> +For example, all blank spaces at the beginning of a line are
>  considered to be part of the first field.
>  .Pp
>  Fields are specified by the
>  .Sm off
>  .Fl k\ \& Ar field1 Op , Ar field2
>  .Sm on
> -argument.
> -A missing
> +command-line option.
> +If
>  .Ar field2
> -argument defaults to the end of a line.
> +is missing, the end of the key defaults to the end of the line.
>  .Pp
>  The arguments
>  .Ar field1
> @@ -285,12 +412,25 @@ and
>  have the form
>  .Em m.n
>  .Em (m,n > 0)
> -and can be followed by one or more of the letters
> +and can be followed by one or more of the modifiers
>  .Cm b , d , f , i ,
> -.Cm n ,
> +.Cm n , g , M
>  and
>  .Cm r ,
>  which correspond to the options discussed above.
> +When
> +.Cm b
> +is specified it applies only to
> +.Ar field1
> +or
> +.Ar field2
> +where it is specified while the rest of the modifiers
> +apply to the whole key field regardless if they are
> +specified only with
> +.Ar field1
> +or
> +.Ar field2
> +or both.
>  A
>  .Ar field1
>  position specified by
> @@ -327,13 +467,18 @@ if
>  .Em n
>  is greater than the length of the line, the field is taken to be empty.
>  .Pp
> +.Em n Ns th
> +positions are always counted from the field beginning, even if the field
> +is shorter than the number of specified positions.
> +Thus, the key can really start from a position in a subsequent field.
> +.Pp
>  A
>  .Ar field2
>  position specified by
>  .Em m.n
>  is interpreted as the
>  .Em n Ns th
> -character (including separators) of the
> +character (including separators) from the beginning of the
>  .Em m Ns th
>  field.
>  A missing
> @@ -346,7 +491,7 @@ field;
>  designates the end of a line.
>  Thus the option
>  .Fl k Ar v.x,w.y
> -is synonymous with the obsolescent option
> +is synonymous with the obsolete option
>  .Cm \(pl Ns Ar v-\&1.x-\&1
>  .Fl Ns Ar w-\&1.y ;
>  when
> @@ -356,7 +501,7 @@ is omitted,
>  is synonymous with
>  .Cm \(pl Ns Ar v-\&1.x-\&1
>  .Fl Ns Ar w\&.0 .
> -The obsolescent
> +The obsolete
>  .Cm \(pl Ns Ar pos1
>  .Fl Ns Ar pos2
>  option is still supported, except for
> @@ -365,9 +510,34 @@ which has no
>  .Fl k
>  equivalent.
>  .Sh ENVIRONMENT
> -.Bl -tag -width Fl
> +.Bl -tag -width Ev
> +.It Ev LANG
> +Used as a last resort to determine different kinds of locale-specific
> +behavior if neither the respective environment variable, nor
> +.Ev LC_ALL
> +are set.
> +.It Ev LC_ALL
> +Locale settings that override all of the above locale settings.
> +This environment variable can be used to set all these settings
> +to the same value at once.
> +.It Ev LC_COLLATE
> +Locale settings to be used to determine the collation for
> +sorting records.
> +.It Ev LC_CTYPE
> +Locale settings to be used to case conversion and classification
> +of characters, that is, which characters are considered
> +whitespaces, etc.
> +.It Ev LC_MESSAGES
> +Locale settings that determine the language of output messages
> +that
> +.Nm
> +prints out.
> +.It Ev LC_NUMERIC
> +Locale settings that determine the number format used in numeric sort.
> +.It Ev LC_TIME
> +Locale settings that determine the month format used in month sort.
>  .It Ev TMPDIR
> -Path in which to store temporary files.
> +Path to the directory in which temporary files will be stored.
>  Note that
>  .Ev TMPDIR
>  may be overridden by the
> @@ -376,41 +546,36 @@ option.
>  .El
>  .Sh FILES
>  .Bl -tag -width Pa -compact
> -.It Pa /var/tmp/sort.*
> -default temporary directories
> -.It Pa output Ns #PID
> -temporary name for
> -.Ar output
> -if
> -.Ar output
> -already exists
> +.It Pa /var/tmp/.bsdsort.PID.*
> +Temporary files.
> +.It Pa /dev/random
> +Default seed file for the random sort.
>  .El
>  .Sh EXIT STATUS
>  The
>  .Nm
> -utility exits with one of the following values:
> +utility shall exit with one of the following values:
>  .Pp
> -.Bl -tag -width Ds -offset indent -compact
> +.Bl -tag -width flag -compact
>  .It 0
> -Normal behavior.
> -.It 1
> -The input file is not sorted and
> -.Fl C
> +Successfully sorted the input files or if used with
> +.Fl c
>  or
> +.Fl C ,
> +the input file already met the sorting criteria.
> +.It 1
> +On disorder (or non-uniqueness) with the
>  .Fl c
> -was given, or there are duplicate keys and
> -.Fl Cu
>  or
> -.Fl cu
> -was given.
> +.Fl C
> +options.
>  .It 2
>  An error occurred.
>  .El
>  .Sh SEE ALSO
>  .Xr comm 1 ,
>  .Xr join 1 ,
> -.Xr uniq 1 ,
> -.Xr radixsort 3
> +.Xr uniq 1
>  .Sh STANDARDS
>  The
>  .Nm
> @@ -419,46 +584,49 @@ utility is compliant with the
>  specification.
>  .Pp
>  The flags
> -.Op Fl HRsTz
> -are extensions to that specification.
> +.Op Fl ghMRSsTVz
> +are extensions to the POSIX specification.
> +.Pp
> +All long options are extensions to the specification;
> +some of them are provided for compatibility with GNU
> +.Nm .
> +.Pp
> +The old key notations
> +.Cm \(pl Ns Ar pos1
> +and
> +.Fl Ns Ar pos2
> +come from older versions of
> +.Nm
> +and are still supported but their use is highly discouraged.
>  .Sh HISTORY
>  A
>  .Nm
> -command appeared in
> +command first appeared in
>  .At v3 .
> +.Sh AUTHORS
> +.An Gabor Kovesdan Aq Mt [email protected]
> +.br
> +.An Oleg Moskalenko Aq Mt [email protected]
>  .Sh NOTES
> +This implementation of
>  .Nm
>  has no limits on input line length (other than imposed by available
>  memory) or any restrictions on bytes allowed within lines.
>  .Pp
> -To protect data
> -.Nm
> -.Fl o
> -calls
> -.Xr link 2
> -and
> -.Xr unlink 2 ,
> -and thus fails on protected directories.
> +The performance depends highly on locale settings,
> +efficient choice of sort keys and key complexity.
> +The fastest sort is with locale C, on whole lines,
> +with option
> +.Fl s.
> +In general, locale C is the fastest, then single-byte
> +locales follow and multi-byte locales as the slowest but
> +the correct collation order is always respected.
> +As for the key specification, the simpler to process the
> +lines the faster the search will be.
>  .Pp
> -The current sort command uses lexicographic radix sorting, which requires
> -that sort keys be kept in memory (as opposed to previous versions which
> -used quick and merge sorts and did not).
> -Thus performance depends highly on efficient choice of sort keys, and the
> -.Fl b
> -option and the
> -.Ar field2
> -argument of the
> -.Fl k
> -option should be used whenever possible.
> -Similarly,
> -.Nm
> -.Fl k1f
> -is equivalent to
> -.Nm
> -.Fl f
> -and may take twice as long.
> -.Sh BUGS
> -To sort files larger than 60MB, use
> -.Nm
> -.Fl H ;
> -files larger than 704MB must be sorted in smaller pieces, then merged.
> +When sorting by arithmetic value, using
> +.Fl n
> +results in much better performance than
> +.Fl g
> +so its use is encouraged
> +whenever possible.

Re: sort(1) updates

Reply via email to