On 2025/03/12 14:54, Mark Kettenis wrote:
> > From: Janne Johansson <[email protected]>
> > Date: Wed, 12 Mar 2025 14:44:37 +0100
> >
> > Den ons 12 mars 2025 kl 14:29 skrev Van Dung Ha <[email protected]>:
> > > I am an OpenBSD user for a long time and really appreciate all the
> > > effort of the community that
> > > contributed to this magnificent distribution. Recently, I was parsing a
> > > few
> > > IP addresses from the snort logs to populate a pf table and encountered
> > > something counter-intuitive. Here is an example source list.
> > >
> > > 89.234.156.205
> > > 151.101.38.172
> > > 104.109.143.150
> > > 104.109.143.150
> > > 77.224.14.2
> > > 77.224.14.21
> > > 104.97.14.224
> > > 77.224.14.18
> > > 77.224.14.21
> > > 2.21.34.170
> > > 199.232.210.172
> > > 2.18.121.27
> > > 91.216.110.53
> > > 34.89.91.10
> > >
> > > When you sort this list using '| sort -u', you will end up with the
> > > following, expected list.
> > >
> > > 104.109.143.150
> > > 104.97.14.224
> > > 151.101.38.172
> > > 199.232.210.172
> > > 2.18.121.27
> > > 2.21.34.170
> > > 34.89.91.10
> > > 77.224.14.18
> > > 77.224.14.2
> > > 77.224.14.21
> > > 89.234.156.205
> > > 91.216.110.53
> > >
> > > The weird thing occurs that when you filter the same source list using
> > > '|
> > > sort -hu', you end up with this shorter list
> > >
> > > 2.18.121.27
> > > 2.21.34.170
> > > 34.89.91.10
> > > 77.224.14.18
> > > 89.234.156.205
> > > 91.216.110.53
> > > 104.109.143.150
> > > 104.97.14.224
> > > 151.101.38.172
> > > 199.232.210.172
> > >
> > > Notice that this list is missing 77.224.14.2 and 77.224.14.21! Is this
> > > by design? My 'human' interpretation is that the missing items are still
> > > unique in the list and should be part of the result list.
> >
> > Same goes for "sort -nu". Seems both -n and -h makes it act weird.
> >
> > Running with --debug on shows it has slightly unexpected ideas:
> > [...]
> > ; k1=<77.224.14.2>, k2=<77.224.14.21>; s1=<77.224.14.2>,
> > s2=<77.224.14.21>; cmp1=0
> > ; k1=<77.224.14.2>, k2=<77.224.14.18>; s1=<77.224.14.2>,
> > s2=<77.224.14.18>; cmp1=0
> > ; k1=<77.224.14.2>, k2=<77.224.14.21>; s1=<77.224.14.2>,
> > s2=<77.224.14.21>; cmp1=0
> > ; k1=<77.224.14.2>, k2=<89.234.156.205>; s1=<77.224.14.2>,
> > s2=<89.234.156.205>; cmp1=-1
> > [...]
> >
> > Kind of hard to see those 77.224 comparisons as equal unless it stops
> > at one of the dots.
> > Doesn't seem to stop at first dot though,
> >
> > ; k1=<2.21.34.170>, k2=<2.18.121.27>; s1=<2.21.34.170>, s2=<2.18.121.27>;
> > cmp1=1
>
> Well, that makes some sort of sense if you interpret the strings as
> floating point numbers and ignore everything after as garbage.
GNU's implementation of sort behaves exactly the same with -h and -n,
their manual says "output only the first of an equal run".
posix says "suppress all but one in each set of lines having equal
keys", and their definition of -n fits into that:
Restrict the sort key to an initial numeric string, consisting
of optional <blank> characters, optional <hyphen-minus> character,
and zero or more digits with an optional radix character and
thousands separators (as defined in the current locale), which
shall be sorted by arithmetic value. An empty digit string shall
be treated as zero. Leading zeros and signs on zeros shall not
affect ordering.
https://pubs.opengroup.org/onlinepubs/9799919799/utilities/sort.html
I think our docs could be improved, but the -n behaviour seems valid and,
importantly, matches the common other implementation and does not seem
to violate posix.
-h is of course an extension, but matching -n seems right.