Re: Human readable sort

2009-04-25 Thread Michael Speer
2009/4/25 Pádraig Brady :
>
> I've further modified your latest in the attached.
> I refactored the suffix finding a bit and also added
> support for --sort=human-numeric.

I refactored it again to handle some potential problems with how
separators and decimals points were handled.  It will still let you
write something silly like "1,3,4.5.6", but I've stopped scanning on
"4..4" or "3,,2" or even "5.M".  I'm not sure if that last one is used
meaningfully anywhere.  I did this partly to avoid breaking locales
where space is the separator.  `du --h --apparent-size` output like
this :

>> 4TO-DO
>> 5Million-dollar-idea
>> 3K  whatever

would have triggered the mixed prefix error spuriously due to the
greedy consumption of space in the second line.  I am not concerned
with making it parse intelligently for all the various locales, but
only to make sure it doesn't do anything particularly stupid.

http://en.wikipedia.org/wiki/ISO_31-0#Numbers

It appears ISO suggests the space for separator.  I poked around a bit
to see if any locales used space.  Apparently, the Hungarian locale
does.  I stopped looking there.

> I'm wondering whether "numeric" is superfluous?
> I.E. are --sort=human and --human-sort sufficient.
>

I started with just human, but thought it better to add the numeric
since sort is by default for strings, and both current switches that
enable numeric sorts have it in their name.  I would not fight a
reversion on this if no one thought it would look confusing or too
inconsistent to end users.

-Michael Speer
--- orig/coreutils-7.2/src/sort.c	2009-03-29 13:44:10.0 -0400
+++ coreutils-7.2/src/sort.c	2009-04-26 00:46:42.0 -0400
@@ -176,6 +176,8 @@
   bool random;			/* Sort by random hash of key.  */
   bool general_numeric;		/* Flag for general, numeric comparison.
    Handle numbers in exponential notation. */
+  bool human_numeric;   /* Flag for sorting by human readable
+   units with either SI xor IEC prefixes. */
   bool month;			/* Flag for comparison by month name. */
   bool reverse;			/* Reverse the sense of comparison. */
   bool version;			/* sort by version number */
@@ -336,6 +338,9 @@
   -i, --ignore-nonprintingconsider only printable characters\n\
   -M, --month-sortcompare (unknown) < `JAN' < ... < `DEC'\n\
 "), stdout);
+  fputs(_("\
+  -h, --human-numeric-sortcompare human readable numbers (e.g., 2K 1G)\n\
+"), stdout);
   fputs (_("\
   -n, --numeric-sort  compare according to string numerical value\n\
   -R, --random-sort   sort by random hash of keys\n\
@@ -344,8 +349,8 @@
 "), stdout);
   fputs (_("\
   --sort=WORD sort according to WORD:\n\
-general-numeric -g, month -M, numeric -n,\n\
-random -R, version -V\n\
+general-numeric -g, human-numeric -h, month -M,\n\
+numeric -n, random -R, version -V\n\
   -V, --version-sort  sort by numeric version\n\
 \n\
 "), stdout);
@@ -426,7 +431,7 @@
   SORT_OPTION
 };
 
-static char const short_options[] = "-bcCdfgik:mMno:rRsS:t:T:uVy:z";
+static char const short_options[] = "-bcCdfghik:mMno:rRsS:t:T:uVy:z";
 
 static struct option const long_options[] =
 {
@@ -442,6 +447,7 @@
   {"merge", no_argument, NULL, 'm'},
   {"month-sort", no_argument, NULL, 'M'},
   {"numeric-sort", no_argument, NULL, 'n'},
+  {"human-numeric-sort", no_argument, NULL, 'h'},
   {"version-sort", no_argument, NULL, 'V'},
   {"random-sort", no_argument, NULL, 'R'},
   {"random-source", required_argument, NULL, RANDOM_SOURCE_OPTION},
@@ -480,6 +486,7 @@
 
 #define SORT_TABLE \
   _st_("general-numeric", 'g') \
+  _st_("human-numeric",   'h') \
   _st_("month",   'M') \
   _st_("numeric", 'n') \
   _st_("random",  'R') \
@@ -1673,6 +1680,85 @@
   return strnumcmp (a, b, decimal_point, thousands_sep);
 }
 
+/* Exit with an error if a mixture of SI and IEC units detected.  */
+
+static void
+check_mixed_SI_IEC (char prefix)
+{
+  static int seen_si = -1;
+  bool si_present = prefix == 'i';
+  if (seen_si != -1 && seen_si != si_present)
+error (SORT_FAILURE, 0, _("both SI and IEC prefixes present on units"));
+  seen_si = si_present;
+}
+
+/* return an integer which represents the order of magnitude of 
+   the unit following the number
+*/
+unsigned int
+find_unit_order (const char* number)
+{
+  /* FIXME : if sort is fixed for multibyte 
+   *   separators this will need to be fixed too 
+   */
+  
+  static const char weights [UCHAR_LIM] = {
+['K']=1, ['M']=2, ['G']=3, ['T']=4, ['P']=5, ['E']=6, ['Z']=7, ['Y']=8,
+['k']=1,
+  };
+  
+  const char *p = number;
+  
+  /* scan to end of number
+   * decimals or separators not followed by digits
+   *   stop the scan
+   * numbers ending in decimals or separators are
+   *   are thus considered to be lacking in units
+   */
+

Re: Human readable sort

2009-04-25 Thread Pádraig Brady
Michael Speer wrote:
> That's much more readable.  I tacked in a size.

Good catch. The size is required or otherwise
one could get undefined results for some chars.

> The standards do not
> reference the lowercase letters you commented out, so I just deleted
> them outright.

Fair enough.

>> Something else to consider is to flag when
>> a mixture of SI and IEC units are used, as
>> this not being supported might not be obvious
>> to users and could cause difficult to debug issues for users.
>> I.E. flag an error if the following input is presented.
>>  999MB
>>  998MiB
>> I added a very quick hack for that to the patch for illustration.
>>
> 
> While du only outputs the first letter, this makes the change better
> for more general use.  I added a bounds check, but do not see anything
> else beyond your illustration would be needed.

Oops, yes the bounds check is also needed.

I've further modified your latest in the attached.
I refactored the suffix finding a bit and also added
support for --sort=human-numeric.
I'm wondering whether "numeric" is superfluous?
I.E. are --sort=human and --human-sort sufficient.

cheers,
Pádraig.
diff --git a/src/sort.c b/src/sort.c
index f48d727..9d7d659 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -176,6 +176,8 @@ struct keyfield
   bool random;			/* Sort by random hash of key.  */
   bool general_numeric;		/* Flag for general, numeric comparison.
    Handle numbers in exponential notation. */
+  bool human_numeric;   /* Flag for sorting by human readable
+   units with either SI xor IEC prefixes. */
   bool month;			/* Flag for comparison by month name. */
   bool reverse;			/* Reverse the sense of comparison. */
   bool version;			/* sort by version number */
@@ -336,6 +338,9 @@ Ordering options:\n\
   -i, --ignore-nonprintingconsider only printable characters\n\
   -M, --month-sortcompare (unknown) < `JAN' < ... < `DEC'\n\
 "), stdout);
+  fputs(_("\
+  -h, --human-numeric-sortcompare human readable numbers (e.g., 2K 1G)\n\
+"), stdout);
   fputs (_("\
   -n, --numeric-sort  compare according to string numerical value\n\
   -R, --random-sort   sort by random hash of keys\n\
@@ -344,8 +349,8 @@ Ordering options:\n\
 "), stdout);
   fputs (_("\
   --sort=WORD sort according to WORD:\n\
-general-numeric -g, month -M, numeric -n,\n\
-random -R, version -V\n\
+general-numeric -g, human-numeric -h, month -M,\n\
+numeric -n, random -R, version -V\n\
   -V, --version-sort  natural sort of (version) numbers within text\n\
 \n\
 "), stdout);
@@ -426,7 +431,7 @@ enum
   SORT_OPTION
 };
 
-static char const short_options[] = "-bcCdfgik:mMno:rRsS:t:T:uVy:z";
+static char const short_options[] = "-bcCdfghik:mMno:rRsS:t:T:uVy:z";
 
 static struct option const long_options[] =
 {
@@ -442,6 +447,7 @@ static struct option const long_options[] =
   {"merge", no_argument, NULL, 'm'},
   {"month-sort", no_argument, NULL, 'M'},
   {"numeric-sort", no_argument, NULL, 'n'},
+  {"human-numeric-sort", no_argument, NULL, 'h'},
   {"version-sort", no_argument, NULL, 'V'},
   {"random-sort", no_argument, NULL, 'R'},
   {"random-source", required_argument, NULL, RANDOM_SOURCE_OPTION},
@@ -480,6 +486,7 @@ static char const check_types[] =
 
 #define SORT_TABLE \
   _st_("general-numeric", 'g') \
+  _st_("human-numeric",   'h') \
   _st_("month",   'M') \
   _st_("numeric", 'n') \
   _st_("random",  'R') \
@@ -1673,6 +1680,60 @@ numcompare (const char *a, const char *b)
   return strnumcmp (a, b, decimal_point, thousands_sep);
 }
 
+/* Exit with an error if a mixture of SI and IEC units detected.  */
+
+static void
+check_mixed_SI_IEC (char prefix)
+{
+  static int seen_si = -1;
+  bool si_present = prefix == 'i';
+  if (seen_si != -1 && seen_si != si_present)
+error (SORT_FAILURE, 0, _("both SI and IEC prefixes present on units"));
+  seen_si = si_present;
+}
+
+/* Return the address of the number suffix or NUL if not present */
+
+static const char*
+find_suffix (const char* number)
+{
+  const char *p = number;
+
+  while (ISDIGIT (*p) || *p == decimal_point || *p == thousands_sep)
+p++;
+
+  if (*p)
+check_mixed_SI_IEC (*(p+1));
+
+  return p;
+}
+
+/* Compare numbers ending in units with SI xor IEC prefixes
+   < K < M < G < T < P < E < Z < Y
+   Assume that numbers are properly abbreviated.
+   i.e. input will never have 5000K instead of 5M.  */
+
+static int
+human_numcompare (const char *a, const char *b)
+{
+  static const char weights [UCHAR_LIM] = {
+['K']=1, ['M']=2, ['G']=3, ['T']=4, ['P']=5, ['E']=6, ['Z']=7, ['Y']=8,
+['k']=1,
+  };
+
+  while (blanks[to_uchar (*a)])
+a++;
+  while (blanks[to_uchar (*b)])
+b++;
+
+  int aw = weights[to_uchar (*find_suffix (a))];
+  int bw = weights[to_uch

Re: Human readable sort

2009-04-25 Thread Michael Speer
2009/4/24 Pádraig Brady :
> Michael Speer wrote:
>> I wrote the following patch to the 7.2 branch of coreutils to allow
>> `sort` to sort by human readable byte sizes.  I looked around a bit to
>> see what the status of previous attempts to integrate this
>> functionality were, but didn't see any very recent activity.  This is
>> my first interaction with coreutils, so if I missed something obvious,
>> please point me towards it.
>>
>> Is the last potential patch (
>> http://www.mail-archive.com/bug-coreutils@gnu.org/msg14080.html )
>> moving through?  If not, if I cleaned this up ( tabs, documentation,
>> and test cases ) and applied it to the current HEAD on savannah is
>> there a chance of getting this functionality into sort?
>
> Thanks for reviving this again.
> There was a more recent attempt that petered out unfortunately:
> http://www.mail-archive.com/bug-coreutils@gnu.org/msg14080.html
>
>>
>> Patch assumptions :
>>   * that numbers will use the best representation ( never uses 1024b
>> instead of 1k, etc )
>>   * that the sizes will be specified via suffixes of b, K, M, G, T, P,
>> E, Z, Y or their alternately cased variants
>>
>> The first assumption results in checking only the suffix when they differ.
>> This enables it to match the output of `du -h / du --si`, but possibly
>> not other tools that do not conform to these assumptions.
>
> The consensus was that these assumptions are appropriate and useful.
>
> We assume C99 support now for coreutils so I tweaked your patch,
> the main change being to greatly shrink the lookup table initialisation.
> Note I commented out the lower case letters (except 'k') as I don't
> think any coreutils generate those and they could preclude supporting
> other suffixes in future. I'm not sure about doing that but I think it's
> better to err on the side of too few suffixes than too many?
>

That's much more readable.  I tacked in a size.  The standards do not
reference the lowercase letters you commented out, so I just deleted
them outright.

> Something else to consider is to flag when
> a mixture of SI and IEC units are used, as
> this not being supported might not be obvious
> to users and could cause difficult to debug issues for users.
> I.E. flag an error if the following input is presented.
>  999MB
>  998MiB
> I added a very quick hack for that to the patch for illustration.
>

While du only outputs the first letter, this makes the change better
for more general use.  I added a bounds check, but do not see anything
else beyond your illustration would be needed.

> I also noticed that you didn't terminate the fields before
> processing as was done for the other numeric sorts?
> So I changed that also in the attached patch but didn't
> analyze it TBH.
>

Your change was entirely appropriate.  I should have done that originally.

>
> p.s. obviously docs and help and tests need to be written,
> but we can do that after we get the implementation done.
>

I've attached the updated diff.

Thanks for taking an interest in this.

Michael Speer
--- orig/coreutils-7.2/src/sort.c	2009-03-29 13:44:10.0 -0400
+++ coreutils-7.2/src/sort.c	2009-04-25 04:46:06.0 -0400
@@ -176,6 +176,8 @@
   bool random;			/* Sort by random hash of key.  */
   bool general_numeric;		/* Flag for general, numeric comparison.
    Handle numbers in exponential notation. */
+  bool human_numeric;   /* Flag for sorting by human readable 
+   units with either SI or IEC prefixes */
   bool month;			/* Flag for comparison by month name. */
   bool reverse;			/* Reverse the sense of comparison. */
   bool version;			/* sort by version number */
@@ -336,6 +338,10 @@
   -i, --ignore-nonprintingconsider only printable characters\n\
   -M, --month-sortcompare (unknown) < `JAN' < ... < `DEC'\n\
 "), stdout);
+  fputs(_("\
+  -h, --human-numeric-sortcompare string numerical values ending in units\n\
+  prefixed with either SI xor IEC prefixes\n\
+"), stdout);
   fputs (_("\
   -n, --numeric-sort  compare according to string numerical value\n\
   -R, --random-sort   sort by random hash of keys\n\
@@ -426,7 +432,7 @@
   SORT_OPTION
 };
 
-static char const short_options[] = "-bcCdfgik:mMno:rRsS:t:T:uVy:z";
+static char const short_options[] = "-bcCdfghik:mMno:rRsS:t:T:uVy:z";
 
 static struct option const long_options[] =
 {
@@ -442,6 +448,7 @@
   {"merge", no_argument, NULL, 'm'},
   {"month-sort", no_argument, NULL, 'M'},
   {"numeric-sort", no_argument, NULL, 'n'},
+  {"human-numeric-sort", no_argument, NULL, 'h'},
   {"version-sort", no_argument, NULL, 'V'},
   {"random-sort", no_argument, NULL, 'R'},
   {"random-source", required_argument, NULL, RANDOM_SOURCE_OPTION},
@@ -1673,6 +1680,57 @@
   return strnumcmp (a, b, decimal_point, thousands_sep);
 }
 
+/* error if a mixture of SI and IEC units used.  */
+static void
+check_mixed_SI_IEC (char prefix)
+{
+  static int s

Re: [PATCH] build: use automake's new $(AM_V_GEN) and $(AM_V_at) variables

2009-04-25 Thread Ralf Wildenhues
Hi Jim,

* Jim Meyering wrote on Sat, Apr 25, 2009 at 08:04:36AM CEST:
> Ralf Wildenhues wrote:
> > Jim Meyering writes:
> >> +  $(AM_V_GEN)
> >> +  $(AM_V_at)rm -f $@ $...@-t

> > Thanks for using these.  Note the $(AM_V_GEN) will expand to the empty 
> > string
> > with V=1 or --disable-silent-rules.  I am actually not sure whether any make
> > implementation fails hard upon empty rule commands, but a couple of them 
> > will
> > output a warning, such as FreeBSD:

> How about making AM_V_GEN expand to ":;" rather than the empty string?
> That would seem slightly more symmetric.
> Fewer pitfalls for users, too.

Yes, I'm still considering that as an alternative.  There are two minor
draw-backs of using ":;" over the empty string:

- the verbose output from 'make' will contain the ":;", and users might
  wonder about that; on a more aesthetic note (but still quite able to
  turn a pretty bikeshed color into an ugly one), this makes the verbose
  output from packages using 'silent-rules' differ ever so slightly more
  from the output from packages not using the 'silent-rules' option in
  the first place.

- GNU make has this minor optimization that, when a rule doesn't contain
  any characters special to the shell, then it can go ahead and exec the
  command directly, rather than going through another fork&exec with
sh -c "command ..."

  The ":;" would prevent this optimization.

Thanks!
Ralf


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils