On Fri, Jul 11, 2014 at 12:25:01AM -0400, Jared Yanovich wrote: > On Sun, Jul 06, 2014 at 09:03:17PM +0200, Otto Moerbeek wrote: > > > > Alternatively we could just import the FreeBSD sort(1) rewrite from 2012. > > > > Did you try to > > port it? I won't have time the coming weeks, I'll be on vacation. > > Have fun! When you get back, some notes: > > - there is a lot of fluff that I would guess is there simply for GNU sort(1) > compatibility, like -M (month sort) and -V (version number sort). > > - this version retains the parallel support (pthreads) > > - I retained the original -R (record separator) support instead of -R > for random > > Some of the tests in our regress appear to be wrong (specifically the -b tests > but also a few others). Other than that, this new sort is faster against a > few > quick workloads I whipped up. Completes system 'make build' on amd64. >
i do not think that we should just slap in freebsd's page willy nilly (which i presume is what's happening here). i know that makes things nice and easy for you, but i want to see a diff that just documents any changes to current behaviour. then we can look at updating bits of it where you think the freebsd text is better than ours. jmc > Index: sort.1 > =================================================================== > RCS file: /cvs/src/usr.bin/sort/sort.1,v > retrieving revision 1.40 > diff -u -p -r1.40 sort.1 > --- sort.1 24 Aug 2013 22:18:05 -0000 1.40 > +++ sort.1 11 Jul 2014 04:07:07 -0000 > @@ -1,4 +1,5 @@ > -.\" $OpenBSD: sort.1,v 1.40 2013/08/24 22:18:05 jmc Exp $ > +.\" $OpenBSD: sort.1,v 1.31 2007/08/21 21:22:37 millert Exp $ > +.\" $FreeBSD: head/usr.bin/sort/sort.1.in 264918 2014-04-25 15:27:19Z > bdrewery $ > .\" > .\" Copyright (c) 1991, 1993 > .\" The Regents of the University of California. All rights reserved. > @@ -32,44 +33,46 @@ > .\" > .\" @(#)sort.1 8.1 (Berkeley) 6/6/93 > .\" > -.Dd $Mdocdate: August 24 2013 $ > +.Dd $Mdocdate$ > .Dt SORT 1 > .Os > .Sh NAME > .Nm sort > -.Nd sort, merge, or sequence check text files > +.Nd sort, merge, or sequence check text and binary files > .Sh SYNOPSIS > .Nm sort > -.Op Fl bCcdfHimnrsuz > +.Op Fl bCcdfghiMmnRrsuVz > .Sm off > .Op Fl k\ \& Ar field1 Op , Ar field2 > .Sm on > .Op Fl o Ar output > -.Op Fl R Ar char > -.Bk -words > +.Op Fl R Ar record-separator > +.Op Fl S Ar memsize > .Op Fl T Ar dir > -.Ek > .Op Fl t Ar char > -.Op Ar > +.Op Ar file ... > .Sh DESCRIPTION > The > .Nm > -utility sorts text files by lines, > -operating in one of three modes: sort, merge, or check. > -In sort mode, the specified files are combined and sorted > -by line. > -Merge mode is the same as sort mode except that the input > -files are assumed to be pre-sorted. > -In check mode, a single input file is checked to ensure that > -it is correctly sorted. > -.Pp > -Comparisons are based on one or more sort keys extracted > -from each line of input, and are performed lexicographically. > +utility sorts text and binary files by lines. > +A line is a record separated from the subsequent record by a > +newline (default) or NUL > +.Sq \e0 > +character > +.Po Fl z > +option > +.Pc . > +A record can contain any printable or unprintable characters. > +Comparisons are based on one or more sort keys extracted from > +each line of input, and are performed lexicographically, > +according to the current locale's collating rules and the > +specified command-line options that can tune the actual > +sorting behavior. > By default, if keys are not given, > .Nm > -regards each input line as a single field. > +uses entire lines for comparison. > .Pp > -The options are as follows: > +The command line options are as follows: > .Bl -tag -width Ds > .It Fl C > Check that the single input file is sorted. > @@ -82,123 +85,159 @@ but additionally write a message to > .Em stderr > if the input file is not sorted. > .It Fl m > -Merge only; the input files are assumed to be pre-sorted. > -This option is overridden by the > -.Fl C > -or > -.Fl c > -options, > -if they are also present. > +Merge only. > +The input files are assumed to be pre-sorted. > +If they are not sorted the output order is undefined. > .It Fl o Ar output > -The argument given is the name of an > +Print the output to the > .Ar output > -file to be used instead of the standard output. > -This file can be the same as one of the input files. > -.It Fl T Ar dir > +file instead of the standard output. > +.It Fl S Ar size > Use > -.Ar dir > -as the directory for temporary files. > -The default is the contents of the environment variable > +.Ar size > +for the maximum size of the memory buffer. > +Size modifiers %,b,K,M,G,T,P,E,Z,Y can be used. > +If a memory limit is not explicitly specified, > +.Nm > +takes up to about 90% of available memory. > +If the file size is too big to fit into the memory buffer, > +the temporary disk files are used to perform the sorting. > +.It Fl T Ar dir > +Store temporary files in the directory > +.Ar dir . > +The default path is the value of the environment variable > .Ev TMPDIR > or > .Pa /var/tmp > if > .Ev TMPDIR > -does not exist. > +is not defined. > .It Fl u > -Unique: suppress all but one in each set of lines having equal keys. > -If used with the > -.Fl C > -or > +Unique keys. > +Suppress all lines that have a key that is equal to an already > +processed one. > +This option, similarly to > +.Fl s , > +implies a stable sort. > +If used with > .Fl c > -options, also check that there are no lines with duplicate keys. > -.El > -.Pp > -The following options override the default ordering rules globally: > -.Bl -tag -width indent > -.It Fl H > -Use a merge sort instead of a radix sort. > -This option should be used for files larger than 60MB. > -.It Fl s > -Enable stable sort. > -Uses additional resources (see > -.Xr sradixsort 3 ) . > +or > +.Fl C , > +.Nm > +also checks that there are no lines with duplicate keys. > .El > .Pp > The following options override the default ordering rules. > -If ordering options appear before the first > -.Fl k > -option, they apply globally to all sort keys. > +When ordering options appear independently of key field > +specifications, they apply globally to all sort keys. > When attached to a specific key (see > .Fl k ) , > -the ordering options override > -all global ordering options for that key. > -Note that the ordering options intended to apply globally should not > -appear after > -.Fl k > -or results may be unexpected. > +the ordering options override all global ordering options for > +the key they are attached to. > .Bl -tag -width indent > +.It Fl b > +Ignore leading blank characters when comparing lines. > .It Fl d > -Only blank space and alphanumeric characters > -.\" according > -.\" to the current setting of LC_CTYPE > -are used in making comparisons. > +Consider only blank spaces and alphanumeric characters in comparisons. > .It Fl f > -Considers all lowercase characters that have uppercase > -equivalents to be the same for purposes of comparison. > +Convert all lowercase characters to their uppercase equivalent > +before comparison, that is, perform case-independent sorting > +.Pq Dq case folding . > +.It Fl g > +Sort by general numerical value. > +As opposed to > +.Fl n , > +this option handles general floating points, which have a much > +permissive format than those allowed by > +. Fl n , > +but it has a significant performance drawback. > +.It Fl H > +Use > +.Xr mergesort 3 . > +This is a universal algorithm that can always be used with a guarantee > +on the worst case upper bound on performance but it is not always the > +fastest. > +.It Fl h > +Sort by numerical value, but take into account the SI suffix, > +if present. > +Sort first by numeric sign (negative, zero, or > +positive); then by SI suffix (either empty, or `k' or `K', or one > +of `MGTPEZY', in that order); and finally by numeric value. > +The SI suffix must immediately follow the number. > +For example, '12345K' sorts before '1M', because M is "larger" than K. > .It Fl i > Ignore all non-printable characters. > +.It Fl M > +Sort by month abbreviations. > +Unknown strings are considered smaller than the month names. > .It Fl n > -An initial numeric string, consisting of optional blank space, optional > -minus sign, and zero or more digits (including decimal point) > -.\" with > -.\" optional radix character and thousands > -.\" separator > -.\" (as defined in the current locale), > -is sorted by arithmetic value. > -(The > -.Fl n > -option no longer implies the > -.Fl b > -option.) > +Sort fields numerically by arithmetic value. > +Fields are supposed to have optional blanks in the beginning, an > +optional minus sign, zero or more digits (including decimal point and > +possible thousand separators). > .It Fl r > -Reverse the sense of comparisons. > +Sort in reverse order. > +.It Fl s > +Stable sort. > +This option maintains the original record order of records that have > +and equal key. > +This may use additional memory. > +.It Fl V > +Sort version numbers. > +The input lines are treated as file names in form > +.Ar PREFIX VERSION SUFFIX , > +where > +.Ar SUFFIX > +matches the regular expression > +"(\.([A-Za-z~][A-Za-z0-9~]*)?)*". > +The files are compared by their prefixes and versions (leading > +zeros are ignored in version numbers, see example below). > +If an input string does not match the pattern, then it is compared > +using the byte compare function. > +All string comparisons are performed in C locale, the locale > +environment setting is ignored. > +.Pp > +Example: > +.Bd -literal -offset 5n > +$ ls sort* | sort -V > +sort-1.022.tgz > +sort-1.23.tgz > +sort-1.23.1.tgz > +sort-1.024.tgz > +sort-1.024.003. > +sort-1.024.003.tgz > +sort-1.024.07.tgz > +sort-1.024.009.tgz > +.Ed > .El > .Pp > The treatment of field separators can be altered using these options: > .Bl -tag -width indent > .It Fl b > -Ignores leading blank space when determining the start > -and end of a restricted sort key. > -A > +Ignore leading blank space when determining the start > +and end of a restricted sort key > +.Pq see Fl k . > +If > .Fl b > -option specified before the first > +is specified before the first > .Fl k > -option applies globally to all > -.Fl k > -options. > -Otherwise, the > +option, it applies globally to all key specifications. > +Otherwise, > .Fl b > -option can be attached independently to each > +can be attached independently to each > .Ar field > -argument of the > -.Fl k > -option (see below). > -Note that > -.Fl b > -should not appear after > -.Fl k , > -and that it has no effect unless key fields are specified. > -.It Fl R Ar char > -.Ar char > -is used as the record separator character. > +argument of the key specifications. > +.It Fl R Ar str > +.Ar str > +is used as the record separator. > This should be used with discretion; > .Fl R Aq Ar alphanumeric > usually produces undesirable results. > The default record separator is newline. > .It Fl t Ar char > +Use > .Ar char > -is used as the field separator character. > +as the field separator character. > The initial > .Ar char > is not considered to be part of a field when determining key offsets. > @@ -210,13 +249,21 @@ delimits an empty field). > If > .Fl t > is not specified, the default field separator is a sequence of > -blank-space characters, and consecutive blank spaces do > +blank space characters, and consecutive blank spaces do > .Em not > -delimit an empty field; further, the initial blank space > +delimit an empty field, however, the initial blank space > .Em is > considered part of a field when determining key offsets. > +To use NUL as field separator, use > +.Fl t > +.Sq \e0 . > .It Fl z > -Uses the nul character as the record separator. > +Use NUL as record separator. > +By default, records in the files are supposed to be separated by > +the newline characters. > +With this option, NUL > +.Pq Sq \e0 > +is used as a record separator character. > .El > .Pp > Sort keys are specified with: > @@ -226,21 +273,102 @@ Sort keys are specified with: > .Fl k\ \& Ar field1 Op , Ar field2 > .Sm on > .Xc > -Designates the starting position, > +Define a restricted sort key that has the starting position > .Ar field1 , > -and optional ending position, > -.Ar field2 , > +and optional ending position > +.Ar field2 > of a key field. > The > .Fl k > option may be specified multiple times, > -in which case subsequent keys are compared after earlier keys compare equal. > -The > -.Fl k > -option replaces the obsolescent options > -.Cm \(pl Ns Ar pos1 > +in which case subsequent keys are compared when earlier keys compare equal. > +.El > +.Pp > +Other options: > +.Bl -tag -width indent > +.It Fl Fl batch-size Ns = Ns Ar num > +Specify maximum number of files that can be opened by > +.Nm > +at once. > +This option affects behavior when having many input files or using > +temporary files. > +The default value is 16. > +.It Fl Fl compress-program Ns = Ns Ar program > +Use > +.Ar program > +to compress temporary files. > +.Ar program > +must compress standard input to standard output, when called > +without arguments. > +When called with argument > +.Fl d > +it must decompress standard input to standard output. > +If > +.Ar program > +fails, > +.Nm > +exits with error. > +An example of > +.Ar program > +that can be used here is bzip2. > +.It Fl Fl files0-from Ns = Ns Ar filename > +Take the input file list from the file > +.Ar filename. > +The file names must be separated by NUL characters > +(e.g. from the command > +.Cm find ... -print0 ) . > +.It Fl Fl heapsort > +Try to use heap sort, if the sort specifications allow. > +This sort algorithm cannot be used with > +.Fl u > +and > +.Fl s . > +.It Fl Fl mmap > +Use > +.Xr mmap 2 > +instead of > +.Xr stdio 3 > +for file I/O, which may improve performance in some cases. > +.It Fl Fl parallel Ns = Ns Ar N > +Set the maximum number of execution threads. > +Defaults to the number of CPUs available. > +.It Fl Fl qsort > +Try to use quick sort, if the sort specifications allow. > +This sort algorithm cannot be used with > +.Fl u > and > -.Fl Ns Ar pos2 . > +.Fl s . > +.It Fl Fl radixsort > +Try to use radix sort, if the sort specifications allow. > +The radix sort can only be used for trivial locales (C and POSIX), > +and it cannot be used for numeric or month sort. > +Radix sort is very fast and uses no additional memory when a stable sort > +is not required. > +Note: > +.Fl R > +is ignored with radix sort. > +.It Fl Fl random-sort > +Sort by a random order. > +This is a random permutation of the inputs except that > +the equal keys sort together. > +It is implemented by hashing the input keys and sorting > +the hash values. > +The hash function is chosen randomly. > +The hash function is randomized by > +.Pa /dev/random > +content, or by file content if it is specified by > +.Fl Fl random-source . > +Even if multiple sort fields are specified, > +the same random hash function is used for all of them. > +.It Fl Fl random-source Ns = Ns Ar filename > +In random sort, the file content is used as the source of the 'seed' data > +for the hash function choice. > +Two invocations of random sort with the same seed data will use > +the same hash function and will produce the same result if the input is > +also identical. > +By default, file > +.Pa /dev/random > +is used. > .El > .Pp > The following operands are available: > @@ -257,8 +385,7 @@ the standard input is used. > .El > .Pp > A field is defined as a maximal sequence of characters other than the > -field separator and record separator > -.Pq newline by default . > +field separator and record separator (newline by default). > Initial blank spaces are included in the field unless > .Fl b > has been specified; > @@ -266,17 +393,17 @@ the first blank space of a sequence of b > separator and is included in the field (unless > .Fl t > is specified). > -For example, by default all blank spaces at the beginning of a line are > +For example, all blank spaces at the beginning of a line are > considered to be part of the first field. > .Pp > Fields are specified by the > .Sm off > .Fl k\ \& Ar field1 Op , Ar field2 > .Sm on > -argument. > -A missing > +command-line option. > +If > .Ar field2 > -argument defaults to the end of a line. > +is missing, the end of the key defaults to the end of the line. > .Pp > The arguments > .Ar field1 > @@ -285,12 +412,25 @@ and > have the form > .Em m.n > .Em (m,n > 0) > -and can be followed by one or more of the letters > +and can be followed by one or more of the modifiers > .Cm b , d , f , i , > -.Cm n , > +.Cm n , g , M > and > .Cm r , > which correspond to the options discussed above. > +When > +.Cm b > +is specified it applies only to > +.Ar field1 > +or > +.Ar field2 > +where it is specified while the rest of the modifiers > +apply to the whole key field regardless if they are > +specified only with > +.Ar field1 > +or > +.Ar field2 > +or both. > A > .Ar field1 > position specified by > @@ -327,13 +467,18 @@ if > .Em n > is greater than the length of the line, the field is taken to be empty. > .Pp > +.Em n Ns th > +positions are always counted from the field beginning, even if the field > +is shorter than the number of specified positions. > +Thus, the key can really start from a position in a subsequent field. > +.Pp > A > .Ar field2 > position specified by > .Em m.n > is interpreted as the > .Em n Ns th > -character (including separators) of the > +character (including separators) from the beginning of the > .Em m Ns th > field. > A missing > @@ -346,7 +491,7 @@ field; > designates the end of a line. > Thus the option > .Fl k Ar v.x,w.y > -is synonymous with the obsolescent option > +is synonymous with the obsolete option > .Cm \(pl Ns Ar v-\&1.x-\&1 > .Fl Ns Ar w-\&1.y ; > when > @@ -356,7 +501,7 @@ is omitted, > is synonymous with > .Cm \(pl Ns Ar v-\&1.x-\&1 > .Fl Ns Ar w\&.0 . > -The obsolescent > +The obsolete > .Cm \(pl Ns Ar pos1 > .Fl Ns Ar pos2 > option is still supported, except for > @@ -365,9 +510,34 @@ which has no > .Fl k > equivalent. > .Sh ENVIRONMENT > -.Bl -tag -width Fl > +.Bl -tag -width Ev > +.It Ev LANG > +Used as a last resort to determine different kinds of locale-specific > +behavior if neither the respective environment variable, nor > +.Ev LC_ALL > +are set. > +.It Ev LC_ALL > +Locale settings that override all of the above locale settings. > +This environment variable can be used to set all these settings > +to the same value at once. > +.It Ev LC_COLLATE > +Locale settings to be used to determine the collation for > +sorting records. > +.It Ev LC_CTYPE > +Locale settings to be used to case conversion and classification > +of characters, that is, which characters are considered > +whitespaces, etc. > +.It Ev LC_MESSAGES > +Locale settings that determine the language of output messages > +that > +.Nm > +prints out. > +.It Ev LC_NUMERIC > +Locale settings that determine the number format used in numeric sort. > +.It Ev LC_TIME > +Locale settings that determine the month format used in month sort. > .It Ev TMPDIR > -Path in which to store temporary files. > +Path to the directory in which temporary files will be stored. > Note that > .Ev TMPDIR > may be overridden by the > @@ -376,41 +546,36 @@ option. > .El > .Sh FILES > .Bl -tag -width Pa -compact > -.It Pa /var/tmp/sort.* > -default temporary directories > -.It Pa output Ns #PID > -temporary name for > -.Ar output > -if > -.Ar output > -already exists > +.It Pa /var/tmp/.bsdsort.PID.* > +Temporary files. > +.It Pa /dev/random > +Default seed file for the random sort. > .El > .Sh EXIT STATUS > The > .Nm > -utility exits with one of the following values: > +utility shall exit with one of the following values: > .Pp > -.Bl -tag -width Ds -offset indent -compact > +.Bl -tag -width flag -compact > .It 0 > -Normal behavior. > -.It 1 > -The input file is not sorted and > -.Fl C > +Successfully sorted the input files or if used with > +.Fl c > or > +.Fl C , > +the input file already met the sorting criteria. > +.It 1 > +On disorder (or non-uniqueness) with the > .Fl c > -was given, or there are duplicate keys and > -.Fl Cu > or > -.Fl cu > -was given. > +.Fl C > +options. > .It 2 > An error occurred. > .El > .Sh SEE ALSO > .Xr comm 1 , > .Xr join 1 , > -.Xr uniq 1 , > -.Xr radixsort 3 > +.Xr uniq 1 > .Sh STANDARDS > The > .Nm > @@ -419,46 +584,49 @@ utility is compliant with the > specification. > .Pp > The flags > -.Op Fl HRsTz > -are extensions to that specification. > +.Op Fl ghMRSsTVz > +are extensions to the POSIX specification. > +.Pp > +All long options are extensions to the specification; > +some of them are provided for compatibility with GNU > +.Nm . > +.Pp > +The old key notations > +.Cm \(pl Ns Ar pos1 > +and > +.Fl Ns Ar pos2 > +come from older versions of > +.Nm > +and are still supported but their use is highly discouraged. > .Sh HISTORY > A > .Nm > -command appeared in > +command first appeared in > .At v3 . > +.Sh AUTHORS > +.An Gabor Kovesdan Aq Mt [email protected] > +.br > +.An Oleg Moskalenko Aq Mt [email protected] > .Sh NOTES > +This implementation of > .Nm > has no limits on input line length (other than imposed by available > memory) or any restrictions on bytes allowed within lines. > .Pp > -To protect data > -.Nm > -.Fl o > -calls > -.Xr link 2 > -and > -.Xr unlink 2 , > -and thus fails on protected directories. > +The performance depends highly on locale settings, > +efficient choice of sort keys and key complexity. > +The fastest sort is with locale C, on whole lines, > +with option > +.Fl s. > +In general, locale C is the fastest, then single-byte > +locales follow and multi-byte locales as the slowest but > +the correct collation order is always respected. > +As for the key specification, the simpler to process the > +lines the faster the search will be. > .Pp > -The current sort command uses lexicographic radix sorting, which requires > -that sort keys be kept in memory (as opposed to previous versions which > -used quick and merge sorts and did not). > -Thus performance depends highly on efficient choice of sort keys, and the > -.Fl b > -option and the > -.Ar field2 > -argument of the > -.Fl k > -option should be used whenever possible. > -Similarly, > -.Nm > -.Fl k1f > -is equivalent to > -.Nm > -.Fl f > -and may take twice as long. > -.Sh BUGS > -To sort files larger than 60MB, use > -.Nm > -.Fl H ; > -files larger than 704MB must be sorted in smaller pieces, then merged. > +When sorting by arithmetic value, using > +.Fl n > +results in much better performance than > +.Fl g > +so its use is encouraged > +whenever possible.
