Implement "uniq -z", like "sort -z"

James Youngman Sat, 12 May 2007 07:41:33 -0700

This patch (against current coreutils anon-CVS) implements a -z option
to uniq.


It is like sort's -z option.   It has the same long form,
--zero-terminated, as the -z option to sort.    The implementation
relies on the linebuffer module offering readlinebuffer_delim.  I
submitted an implementation for that function to bug-gnulib this
morning.

To avoid potential problems with whitespace change, I have attached
the patch to this email rather than pasting it in.  I also include the
text of the relevant ChangeLog entry inline.

Thanks,
James.



2007-05-12  James Youngman  <[EMAIL PROTECTED]>

        Add -z option to uniq.  This was originally proposed by
        Egmont Koblinger.
        * NEWS: Mention that uniq has gained a new option,
        --zero-terminated (-z).
        * src/uniq.c (longopts, check_file, main): Added new option
        --zero-terminated (-z).  This makes uniq consume and produce
        NUL-terminated lines rather than newline-terminated lines, which
        is the default.   Pass the delimiter as a function argument from
        main into check_file.
        * doc/uniq.texi (uniq invocation): Describe the new option
        --zero-terminated (-z).
        * tests/uniq/Test.pm (@tv): add a number of new tests for the uniq
        option -z (and its synonym, --zero-terminated).
        * tests/uniq/Makefile.am (run_gen, maint_gen): add the new test
        files generated by the updated Test.pm file.

2007-05-12  James Youngman  <[EMAIL PROTECTED]>

	Add -z option to uniq.  This was originally proposed by 
	Egmont Koblinger. 
	* NEWS: Mention that uniq has gained a new option,
	--zero-terminated (-z).
	* src/uniq.c (longopts, check_file, main): Added new option
	--zero-terminated (-z).  This makes uniq consume and produce
	NUL-terminated lines rather than newline-terminated lines, which
	is the default.   Pass the delimiter as a function argument from
	main into check_file.
	* doc/uniq.texi (uniq invocation): Describe the new option
	--zero-terminated (-z).
	* tests/uniq/Test.pm (@tv): add a number of new tests for the uniq
	option -z (and its synonym, --zero-terminated).
	* tests/uniq/Makefile.am (run_gen, maint_gen): add the new test
	files generated by the updated Test.pm file.

Index: NEWS
===================================================================
RCS file: /sources/coreutils/coreutils/NEWS,v
retrieving revision 1.492
diff -u -p -r1.492 NEWS
--- NEWS	8 May 2007 14:03:32 -0000	1.492
+++ NEWS	12 May 2007 14:27:50 -0000
@@ -6,6 +6,10 @@ GNU coreutils NEWS                      
 
   Add SELinux support (FIXME: add details here)
 
+  uniq accepts a new option: --zero-terminated (-z).  As with the sort
+  option of the same name, this makes uniq consume and produce
+  NUL-terminated lines rather than newline-terminated lines.
+
 ** Bug fixes
 
   ls -x DIR would sometimes output the wrong string in place of the
Index: doc/coreutils.texi
===================================================================
RCS file: /sources/coreutils/coreutils/doc/coreutils.texi,v
retrieving revision 1.380
diff -u -p -r1.380 coreutils.texi
--- doc/coreutils.texi	3 May 2007 11:52:47 -0000	1.380
+++ doc/coreutils.texi	12 May 2007 14:27:54 -0000
@@ -4261,6 +4261,19 @@ Compare at most @var{n} characters on ea
 fields and characters).  By default the entire rest of the lines are
 compared.
 
[EMAIL PROTECTED] -z
[EMAIL PROTECTED] --zero-terminated
[EMAIL PROTECTED] -z
[EMAIL PROTECTED] --zero-terminated
[EMAIL PROTECTED] sort zero-terminated lines
+Treat the input as a set of lines, each terminated by a null character
+(@acronym{ASCII} @sc{nul}) instead of a line feed
+(@acronym{ASCII} @sc{lf}).
+This option can be useful in conjunction with @samp{sort -z}, @samp{perl -0} or
[EMAIL PROTECTED] -print0} and @samp{xargs -0} which do the same in order to
+reliably handle arbitrary file names (even those containing blanks
+or other special characters).
+
 @end table
 
 @exitstatus
Index: src/uniq.c
===================================================================
RCS file: /sources/coreutils/coreutils/src/uniq.c,v
retrieving revision 1.130
diff -u -p -r1.130 uniq.c
--- src/uniq.c	28 Mar 2007 06:57:40 -0000	1.130
+++ src/uniq.c	12 May 2007 14:27:54 -0000
@@ -1,5 +1,5 @@
 /* uniq -- remove duplicate lines from a sorted file
-   Copyright (C) 86, 91, 1995-2006 Free Software Foundation, Inc.
+   Copyright (C) 86, 91, 1995-2007 Free Software Foundation, Inc.
 
    This program is free software; you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
@@ -119,6 +119,7 @@ static struct option const longopts[] =
   {"skip-fields", required_argument, NULL, 'f'},
   {"skip-chars", required_argument, NULL, 's'},
   {"check-chars", required_argument, NULL, 'w'},
+  {"zero-terminated", no_argument, NULL, 'z'},
   {GETOPT_HELP_OPTION_DECL},
   {GETOPT_VERSION_OPTION_DECL},
   {NULL, 0, NULL, 0}
@@ -156,6 +157,7 @@ Mandatory arguments to long options are 
   -i, --ignore-case     ignore differences in case when comparing\n\
   -s, --skip-chars=N    avoid comparing the first N characters\n\
   -u, --unique          only print unique lines\n\
+  -z, --zero-terminated end lines with 0 byte, not newline\n\
 "), stdout);
      fputs (_("\
   -w, --check-chars=N   compare no more than N characters in lines\n\
@@ -268,7 +270,7 @@ writeline (struct linebuffer const *line
    If either is "-", use the standard I/O stream for it instead. */
 
 static void
-check_file (const char *infile, const char *outfile)
+check_file (const char *infile, const char *outfile, char delimiter)
 {
   struct linebuffer lb1, lb2;
   struct linebuffer *thisline, *prevline;
@@ -300,7 +302,7 @@ check_file (const char *infile, const ch
 	{
 	  char *thisfield;
 	  size_t thislen;
-	  if (readlinebuffer (thisline, stdin) == 0)
+	  if (readlinebuffer_delim (thisline, stdin, delimiter) == 0)
 	    break;
 	  thisfield = find_field (thisline);
 	  thislen = thisline->length - 1 - (thisfield - thisline->buffer);
@@ -323,7 +325,7 @@ check_file (const char *infile, const ch
       uintmax_t match_count = 0;
       bool first_delimiter = true;
 
-      if (readlinebuffer (prevline, stdin) == 0)
+      if (readlinebuffer_delim (prevline, stdin, delimiter) == 0)
 	goto closefiles;
       prevfield = find_field (prevline);
       prevlen = prevline->length - 1 - (prevfield - prevline->buffer);
@@ -333,7 +335,7 @@ check_file (const char *infile, const ch
 	  bool match;
 	  char *thisfield;
 	  size_t thislen;
-	  if (readlinebuffer (thisline, stdin) == 0)
+	  if (readlinebuffer_delim (thisline, stdin, delimiter) == 0)
 	    {
 	      if (ferror (stdin))
 		goto closefiles;
@@ -406,7 +408,8 @@ main (int argc, char **argv)
   enum Skip_field_option_type skip_field_option_type = SFO_NONE;
   int nfiles = 0;
   char const *file[2];
-
+  char delimiter = '\n';	/* change with --zero-terminated, -z */
+  
   file[0] = file[1] = "-";
   initialize_main (&argc, &argv);
   program_name = argv[0];
@@ -434,7 +437,7 @@ main (int argc, char **argv)
       if (optc == -1
 	  || (posixly_correct && nfiles != 0)
 	  || ((optc = getopt_long (argc, argv,
-				   "-0123456789Dcdf:is:uw:", longopts, NULL))
+				   "-0123456789Dcdf:is:uw:z", longopts, NULL))
 	      == -1))
 	{
 	  if (argc <= optind)
@@ -530,6 +533,10 @@ main (int argc, char **argv)
 				  N_("invalid number of bytes to compare"));
 	  break;
 
+	case 'z':
+	  delimiter = 0;
+	  break;
+
 	case_GETOPT_HELP_CHAR;
 
 	case_GETOPT_VERSION_CHAR (PROGRAM_NAME, AUTHORS);
@@ -546,7 +553,7 @@ main (int argc, char **argv)
       usage (EXIT_FAILURE);
     }
 
-  check_file (file[0], file[1]);
+  check_file (file[0], file[1], delimiter);
 
   exit (EXIT_SUCCESS);
 }
Index: tests/uniq/Makefile.am
===================================================================
RCS file: /sources/coreutils/coreutils/tests/uniq/Makefile.am,v
retrieving revision 1.19
diff -u -p -r1.19 Makefile.am
--- tests/uniq/Makefile.am	15 Jan 2007 10:33:49 -0000	1.19
+++ tests/uniq/Makefile.am	12 May 2007 14:27:54 -0000
@@ -21,26 +21,32 @@
 ##test-files-begin
 x = uniq
 explicit =
-maint_gen = 1.I 1.X 2.I 2.X 3.I 3.X 4.I 4.X 5.I 5.X 6.I 6.X 7.I 7.X 8.I 8.X \
-9.I 9.X 10.I 10.X 11.I 11.X 12.I 12.X 13.I 13.X 20.I 20.X 21.I 21.X 22.I 22.X \
-23.I 23.X obs30.I obs30.X 31.I 31.X 32.I 32.X 33.I 33.X 34.I 34.X 35.I 35.X \
-obs-plus40.I obs-plus40.X obs-plus41.I obs-plus41.X 42.I 42.X 43.I 43.X \
-obs-plus44.I obs-plus44.X obs-plus45.I obs-plus45.X 50.I 50.X 51.I 51.X 52.I \
-52.X 53.I 53.X 54.I 54.X 55.I 55.X 56.I 56.X 57.I 57.X 60.I 60.X 61.I 61.X \
-62.I 62.X 63.I 63.X 64.I 64.X 65.I 65.X 90.I 90.X 91.I 91.X 92.I 92.X 93.I \
-93.X 94.I 94.X 101.I 101.X 102.I 102.X 110.I 110.X 111.I 111.X 112.I 112.X \
-113.I 113.X 114.I 114.X 115.I 115.X 116.I 116.X 117.I 117.X 118.I 118.X 119.I \
-119.X 120.I 120.X 121.I 121.X
-run_gen = 1.O 1.E 2.O 2.E 3.O 3.E 4.O 4.E 5.O 5.E 6.O 6.E 7.O 7.E 8.O 8.E 9.O \
-9.E 10.O 10.E 11.O 11.E 12.O 12.E 13.O 13.E 20.O 20.E 21.O 21.E 22.O 22.E \
-23.O 23.E obs30.O obs30.E 31.O 31.E 32.O 32.E 33.O 33.E 34.O 34.E 35.O 35.E \
-obs-plus40.O obs-plus40.E obs-plus41.O obs-plus41.E 42.O 42.E 43.O 43.E \
-obs-plus44.O obs-plus44.E obs-plus45.O obs-plus45.E 50.O 50.E 51.O 51.E 52.O \
-52.E 53.O 53.E 54.O 54.E 55.O 55.E 56.O 56.E 57.O 57.E 60.O 60.E 61.O 61.E \
-62.O 62.E 63.O 63.E 64.O 64.E 65.O 65.E 90.O 90.E 91.O 91.E 92.O 92.E 93.O \
-93.E 94.O 94.E 101.O 101.E 102.O 102.E 110.O 110.E 111.O 111.E 112.O 112.E \
-113.O 113.E 114.O 114.E 115.O 115.E 116.O 116.E 117.O 117.E 118.O 118.E 119.O \
-119.E 120.O 120.E 121.O 121.E
+maint_gen = 1.I 1.X 2.I 2.X 2z.I 2z.X 2z2.I 2z2.X 3.I 3.X 3z.I 3z.X 3z2.I \
+3z2.X 4.I 4.X 4z.I 4z.X 4z2.I 4z2.X 5.I 5.X 5z.I 5z.X 5z2.I 5z2.X 6.I 6.X \
+6z2.I 6z2.X 7.I 7.X 8.I 8.X 8z.I 8z.X 9.I 9.X 9z.I 9z.X 10.I 10.X 10z.I 10z.X \
+11.I 11.X 11z.I 11z.X 12.I 12.X 13.I 13.X 20.I 20.X 20z.I 20z.X 20z2.I 20z2.X \
+21.I 21.X 22.I 22.X 23.I 23.X 23z.I 23z.X obs30.I obs30.X 31.I 31.X 32.I 32.X \
+33.I 33.X 34.I 34.X 35.I 35.X 35z.I 35z.X obs-plus40.I obs-plus40.X \
+obs-plus41.I obs-plus41.X 42.I 42.X 43.I 43.X obs-plus44.I obs-plus44.X \
+obs-plus45.I obs-plus45.X 50.I 50.X 51.I 51.X 52.I 52.X 53.I 53.X 54.I 54.X \
+55.I 55.X 56.I 56.X 57.I 57.X 60.I 60.X 60z.I 60z.X 61.I 61.X 62.I 62.X 63.I \
+63.X 64.I 64.X 65.I 65.X 90.I 90.X 91.I 91.X 92.I 92.X 93.I 93.X 94.I 94.X \
+101.I 101.X 102.I 102.X 110.I 110.X 111.I 111.X 112.I 112.X 113.I 113.X 114.I \
+114.X 115.I 115.X 116.I 116.X 117.I 117.X 118.I 118.X 119.I 119.X 120.I 120.X \
+121.I 121.X 122.I 122.X 123.I 123.X
+run_gen = 1.O 1.E 2.O 2.E 2z.O 2z.E 2z2.O 2z2.E 3.O 3.E 3z.O 3z.E 3z2.O 3z2.E \
+4.O 4.E 4z.O 4z.E 4z2.O 4z2.E 5.O 5.E 5z.O 5z.E 5z2.O 5z2.E 6.O 6.E 6z2.O \
+6z2.E 7.O 7.E 8.O 8.E 8z.O 8z.E 9.O 9.E 9z.O 9z.E 10.O 10.E 10z.O 10z.E 11.O \
+11.E 11z.O 11z.E 12.O 12.E 13.O 13.E 20.O 20.E 20z.O 20z.E 20z2.O 20z2.E 21.O \
+21.E 22.O 22.E 23.O 23.E 23z.O 23z.E obs30.O obs30.E 31.O 31.E 32.O 32.E 33.O \
+33.E 34.O 34.E 35.O 35.E 35z.O 35z.E obs-plus40.O obs-plus40.E obs-plus41.O \
+obs-plus41.E 42.O 42.E 43.O 43.E obs-plus44.O obs-plus44.E obs-plus45.O \
+obs-plus45.E 50.O 50.E 51.O 51.E 52.O 52.E 53.O 53.E 54.O 54.E 55.O 55.E 56.O \
+56.E 57.O 57.E 60.O 60.E 60z.O 60z.E 61.O 61.E 62.O 62.E 63.O 63.E 64.O 64.E \
+65.O 65.E 90.O 90.E 91.O 91.E 92.O 92.E 93.O 93.E 94.O 94.E 101.O 101.E 102.O \
+102.E 110.O 110.E 111.O 111.E 112.O 112.E 113.O 113.E 114.O 114.E 115.O 115.E \
+116.O 116.E 117.O 117.E 118.O 118.E 119.O 119.E 120.O 120.E 121.O 121.E 122.O \
+122.E 123.O 123.E
 ##test-files-end
 
 EXTRA_DIST = Test.pm $x-tests $(explicit) $(maint_gen)
Index: tests/uniq/Test.pm
===================================================================
RCS file: /sources/coreutils/coreutils/tests/uniq/Test.pm,v
retrieving revision 1.17
diff -u -p -r1.17 Test.pm
--- tests/uniq/Test.pm	13 Dec 2006 21:27:05 -0000	1.17
+++ tests/uniq/Test.pm	12 May 2007 14:27:54 -0000
@@ -29,25 +29,42 @@ my @tv = (
 #
 ['1',  '',    '',                '',                0],
 ['2',  '',    "a\na\n",          "a\n",             0],
+['2z', '-z',  "a\na\n",          "a\na\n\0",        0],
+['2z2','-z',  "a\0a\0",          "a\0",             0],
 ['3',  '',    "a\na",            "a\n",             0],
+['3z', '-z',  "a\na",            "a\na\0",          0],
+['3z2','-z',  "a\0a",            "a\0",             0],
 ['4',  '',    "a\nb",            "a\nb\n",          0],
+['4z', '-z',  "a\nb",            "a\nb\0",          0],
+['4z2','-z',  "a\0b",            "a\0b\0",          0],
 ['5',  '',    "a\na\nb",         "a\nb\n",          0],
+['5z', '-z',  "a\na\nb",         "a\na\nb\0",       0],
+['5z2','-z',  "a\0a\0b",         "a\0b\0",          0],
 ['6',  '',    "b\na\na\n",       "b\na\n",          0],
+['6z2','-z',  "b\0a\0a\0",       "b\0a\0",          0],
+
 ['7',  '',    "a\nb\nc\n",       "a\nb\nc\n",       0],
 # Make sure that eight bit characters work
 ['8',  '',    "Ã¶\nv\n",          "Ã¶\nv\n",          0],
+['8z', '-z',  "Ã¶\0v\0",          "Ã¶\0v\0",          0],
 # Test output of -u option; only unique lines
 ['9',  '-u',  "a\na\n",          "",                0],
+['9z', '-uz', "a\0a\0",          "",                0],
 ['10', '-u',  "a\nb\n",          "a\nb\n",          0],
+['10z','-uz', "a\0b\0",          "a\0b\0",          0],
 ['11', '-u',  "a\nb\na\n",       "a\nb\na\n",       0],
+['11z','-uz', "a\0b\0a\0",       "a\0b\0a\0",       0],
 ['12', '-u',  "a\na\n",          "",                0],
 ['13', '-u',  "a\na\n",          "",                0],
 #['5',  '-u',  "a\na\n",          "",                0],
 # Test output of -d option; only repeated lines
 ['20', '-d',  "a\na\n",          "a\n",             0],
+['20z','-dz', "a\na\n",          "",        0],
+['20z2','-dz',"a\0a\0",          "a\0",             0],
 ['21', '-d',  "a\nb\n",          "",                0],
 ['22', '-d',  "a\nb\na\n",       "",                0],
 ['23', '-d',  "a\na\nb\n",       "a\n",             0],
+['23z','-zd', "a\0a\0b\0",       "a\0",             0],
 # Check the key options
 # If we skip over fields or characters, is the output deterministic?
 ['obs30', '-1',  "a a\nb a\n",      "a a\n",           0],
@@ -56,6 +73,7 @@ my @tv = (
 ['33', '-f 1',"a a a\nb a c\n",  "a a a\nb a c\n",  0],
 ['34', '-f 1',"b a\na a\n",      "b a\n",           0],
 ['35', '-f 2',"a a c\nb a c\n",  "a a c\n",         0],
+['35z','-z -f 2',"a a c\0b a c\0",  "a a c\0",         0],
 # Skip over characters.
 ['obs-plus40', '+1',  "aaa\naaa\n",      "aaa\n",           0],
 ['obs-plus41', '+1',  "baa\naaa\n",      "baa\n",           0],
@@ -76,6 +94,7 @@ my @tv = (
 ['57', '-w 0',     "abc\nabcd\n",        "abc\n",               0],
 # Only account for a number of characters
 ['60', '-w 1',"a a\nb a\n",      "a a\nb a\n",         0],
+['60z','-z -w 1',"a a\0b a\0",      "a a\0b a\0",         0],
 ['61', '-w 3',"a a\nb a\n",      "a a\nb a\n",         0],
 ['62', '-w 1 -f 1',"a a a\nb a c\n",  "a a a\n",       0],
 ['63', '-f 1 -w 1',"a a a\nb a c\n",  "a a a\n",       0],
@@ -107,6 +126,9 @@ my @tv = (
 ['120', '-d -u', "a\na\n\b",        "",                         0],
 ['121', '-d -u -w340282366920938463463374607431768211456',
 		 "a\na\n\b",        "",                         0],
+# Check that --zero-terminated is synonymous with -z.
+['122', '--zero-terminated',  "a\na\nb",         "a\na\nb\0",       0],
+['123', '--zero-terminated',  "a\0a\0b",         "a\0b\0",          0],
 );
 
 sub test_vector

_______________________________________________
Bug-coreutils mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/bug-coreutils

Implement "uniq -z", like "sort -z"

Reply via email to