Re: [PATCH] split: --chunks option

Chen Guo Sun, 03 Jan 2010 12:37:39 -0800

Hi all, hope everyone had happy holidays.

Here's the patch in its entirety. Let me know if anything's not
satisfactory.


I should note that I went easy on the tests because the other
split tests didn't seem all too comprehensive themselves. Please
let me knowif I need to be more exhaustive.

>From fb783060ece188fdbcd805381d02eb3b0477d25a Mon Sep 17 00:00:00 2001
From: Chen Guo <[email protected]>
Date: Sun, 3 Jan 2010 11:16:09 -0800
Subject: [PATCH] split: divide file into equal sized chunks; add -r and -t 
options.

Extend --bytes and --lines to divide file into N equal pieces, or
extract Kth of N said pieces. Add -n/--number alias for BSD
compatibility.

Add -r/--round-robin option to allow division and extraction of
chunks in round robin fashion, in support of nonseekable files.

Add -t/--term option to allow user to choose delineation character;
supports parsing C escape sequences such as \n or \xdd.

* doc/coreutils.texi: update documentation of split.
* src/split.c: (eol): new global variable.
(usage, long_options, main): new options -n/--number, -r, and -t.
(bytes_split): add max_files argument. This allows for trivial
implementaton for byte chunking, similar to BSD.
(lines_split, line_bytes_split): delineate line by global eol char
instead of '\n'.
(lines_chunk_split): new function. Split file into eol delineated
chunks.
(bytes_chunk_extract): new function. Extract a chunk of file.
(lines_chunk_extract): new function. Extract a eol delineated chunk
of file.
(of_info): new struct. Used by new functions lines_rr and ofd_check
to keep track of file descriptors associated with output files.
(ofd_check): new function. Shuffle file descriptors in case output
files out number available file descriptors.
(lines_rr): new function. Split file into chunks in round-robin
fashion.
(lines_rr_extract): new function. Extract a chunk of file, as if
chunks were created in round-robin fashion.
(chunk_parse): new function. Parses /N and K/N syntax.
(eol_parse): new function. Parses -t option argument.
* tests/Makefile.am: add new tests.
* misc/split-bchunk: new test for byte delineated chunking.
* misc/split-fail: add failure scenarios for new options.
* misc/split-l: change typo ln --version to split --version.
* misc/split-lchunk: new test for line delineated chunking.
* misc/split-rchunk: new test for round-robin chunking.
* misc/split-t: new test for user defined eol char.
---
 doc/coreutils.texi      |   57 ++++-
 src/split.c             |  595 ++++++++++++++++++++++++++++++++++++++++++++++-
 tests/Makefile.am       |    4 +
 tests/misc/split-bchunk |   46 ++++
 tests/misc/split-fail   |    8 +
 tests/misc/split-l      |    2 +-
 tests/misc/split-lchunk |   56 +++++
 tests/misc/split-rchunk |   56 +++++
 tests/misc/split-t      |   39 +++
 9 files changed, 841 insertions(+), 22 deletions(-)
 create mode 100755 tests/misc/split-bchunk
 create mode 100755 tests/misc/split-lchunk
 create mode 100755 tests/misc/split-rchunk
 create mode 100755 tests/misc/split-t

diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index 444dbc7..ac022f4 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -104,7 +104,7 @@
 * shuf: (coreutils)shuf invocation.             Shuffling text files.
 * sleep: (coreutils)sleep invocation.           Delay for a specified time.
 * sort: (coreutils)sort invocation.             Sort text files.
-* split: (coreutils)split invocation.           Split into fixed-size pieces.
+* split: (coreutils)split invocation.           Split into pieces.
 * stat: (coreutils)stat invocation.             Report file(system) status.
 * stdbuf: (coreutils)stdbuf invocation.         Modify stdio buffering.
 * stty: (coreutils)stty invocation.             Print/change terminal settings.
@@ -2623,7 +2623,7 @@ These commands output pieces of the input.
 @menu
 * head invocation::             Output the first part of files.
 * tail invocation::             Output the last part of files.
-* split invocation::            Split a file into fixed-size pieces.
+* split invocation::            Split a file into pieces.
 * csplit invocation::           Split a file into context-determined pieces.
 @end menu
 
@@ -2919,15 +2919,15 @@ mean either @samp{tail ./+4} or @samp{tail -n +4}.
 
 
 @node split invocation
-...@section @command{split}: Split a file into fixed-size pieces
+...@section @command{split}: Split a file into pieces.
 
 @pindex split
 @cindex splitting a file into pieces
 @cindex pieces, splitting a file into
 
-...@command{split} creates output files containing consecutive sections of
-...@var{input} (standard input if none is given or @var{input} is
-...@samp{-}).  Synopsis:
+...@command{split} creates output files containing consecutive or interleaved
+sections of @var{input}  (standard input if none is given or @var{input}
+is @samp{-}).  Synopsis:
 
 @example
 split [...@var{option}] [...@var{input} [...@var{prefix}]]
@@ -2940,10 +2940,9 @@ left over for the last section), into each output file.
 The output files' names consist of @var{prefix} (@samp{x} by default)
 followed by a group of characters (@samp{aa}, @samp{ab}, @dots{} by
 default), such that concatenating the output files in traditional
-sorted order by file name produces
-the original input file.  If the output file names are exhausted,
-...@command{split} reports an error without deleting the output files
-that it did create.
+sorted order by file name produces the original input file (except
+...@option{-r}).  If the output file names are exhausted, @command{split}
+reports an error without deleting the output files that it did create.
 
 The program accepts the following options.  Also see @ref{Common options}.
 
@@ -2959,6 +2958,13 @@ For compatibility @command{split} also supports an 
obsolete
 option syntax @optio...@var{lines}}.  New scripts should use @option{-l
 @var{lines}} instead.
 
+...@item -l [...@var{k}]/@var{chunks}
+...@item --line...@var{k}]/@var{chunks}
+If @var{k} is zero or omitted, divide @var{input} into @var{chunks}
+roughly equal-sized line delineated chunks.
+
+If @var{k} is present and nonzero, print @var{k}th of such chunks.
+
 @item -b @var{size}
 @itemx --byt...@var{size}
 @opindex -b
@@ -2966,6 +2972,13 @@ option syntax @optio...@var{lines}}.  New scripts should 
use @option{-l
 Put @var{size} bytes of @var{input} into each output file.
 @multiplierSuffixes{size}
 
+...@item -b [...@var{k}]/@var{chunks}
+...@itemx --byte...@var{k}]/@var{chunks}
+If @var{k} is zero or omitted, divide @var{input} into @var{chunks}
+equal-sized chunks.
+
+If @var{k} is present and nonzero, print @var{k}th of such chunks.
+
 @item -C @var{size}
 @itemx --line-byt...@var{size}
 @opindex -C
@@ -2975,6 +2988,30 @@ possible without exceeding @var{size} bytes.  Individual 
lines longer than
 @var{size} bytes are broken into multiple files.
 @var{size} has the same format as for the @option{--bytes} option.
 
+...@item -n [...@var{k}]/]...@var{chunks}
+...@itemx --number [...@var{k}]/]...@var{chunks}
+...@opindex -n
+...@opindex --number
+Same as @option{--byte...@var{k}]/@var{chunks}}, for BSD compatibility.
+
+...@item -r [...@var{k}]/]...@var{chunks}
+...@itemx --round-robin [...@var{k}]/]...@var{chunks}
+...@opindex -r
+...@opindex --round-robin
+If @var{k} is zero or omitted, distribute @var{input} lines round-robin
+style into @var{chunks} output files.
+
+If @var{k} is present and nonzero, print @var{k}th of such chunks.
+
+...@item -t @var{char}
+...@itemx --term @var{char}
+...@opindex -t
+...@opindex --term
+Set @var{char} as the end of line character.  Supports C escape sequences.
+Using this option with @option{-b @var{size}} is equivalent to
+...@option{-c @var{size}}, and with @option{-b [...@var{k}]/@var{chunks}} is
+equivalent to @option{-l [...@var{k}]/@var{chunks}}.
+
 @item -a @var{length}
 @itemx --suffix-leng...@var{length}
 @opindex -a
diff --git a/src/split.c b/src/split.c
index 5bd9ebb..b1272c4 100644
--- a/src/split.c
+++ b/src/split.c
@@ -17,8 +17,7 @@
 /* By [email protected], with rms.
 
    To do:
-   * Implement -t CHAR or -t REGEX to specify break characters other
-     than newline. */
+   * Extend -t CHAR to -t REGEX */
 
 #include <config.h>
 
@@ -72,6 +71,9 @@ static int output_desc;
    output file is opened. */
 static bool verbose;
 
+/* End of line character */
+static char eol;
+
 /* For long options that have no equivalent short option, use a
    non-character as a pseudo short option, starting with CHAR_MAX + 1.  */
 enum
@@ -84,8 +86,11 @@ static struct option const longopts[] =
   {"bytes", required_argument, NULL, 'b'},
   {"lines", required_argument, NULL, 'l'},
   {"line-bytes", required_argument, NULL, 'C'},
+  {"number", required_argument, NULL, 'n'},
+  {"round-robin", required_argument, NULL, 'r'},
   {"suffix-length", required_argument, NULL, 'a'},
   {"numeric-suffixes", no_argument, NULL, 'd'},
+  {"term", required_argument, NULL, 't'},
   {"verbose", no_argument, NULL, VERBOSE_OPTION},
   {GETOPT_HELP_OPTION_DECL},
   {GETOPT_VERSION_OPTION_DECL},
@@ -116,9 +121,23 @@ Mandatory arguments to long options are mandatory for 
short options too.\n\
       fprintf (stdout, _("\
   -a, --suffix-length=N   use suffixes of length N (default %d)\n\
   -b, --bytes=SIZE        put SIZE bytes per output file\n\
+  -b, --bytes=/N          generate N output files\n\
+  -b, --bytes=K/N         print Kth of N chunks of file\n\
   -C, --line-bytes=SIZE   put at most SIZE bytes of lines per output file\n\
   -d, --numeric-suffixes  use numeric suffixes instead of alphabetic\n\
   -l, --lines=NUMBER      put NUMBER lines per output file\n\
+  -l, --lines=/N          generate N eol delineated output files\n\
+  -l, --lines=K/N         print Kth of N eol delineated chunks\n\
+  -n, --number=N          same as --bytes=/N\n\
+  -n, --number=K/N        same as --bytes=K/N\n\
+  -r, --round-robin=N     generate N eol delineated output files using\n\
+                            round-robin style distribution.\n\
+  -r. --round-robin=K/N   print Kth of N eol delineated chunk as -rN would\n\
+                            have generated.\n\
+  -t, --term=CHAR         specify CHAR as eol. This will also convert\n\
+                            -b to its line delineated equivalent (-C if\n\
+                            splitting normally, -l if splitting by\n\
+                            chunks). C escape sequences are accepted.\n\
 "), DEFAULT_SUFFIX_LENGTH);
       fputs (_("\
       --verbose           print a diagnostic just before each\n\
@@ -218,13 +237,14 @@ cwrite (bool new_file_flag, const char *bp, size_t bytes)
    Use buffer BUF, whose size is BUFSIZE.  */
 
 static void
-bytes_split (uintmax_t n_bytes, char *buf, size_t bufsize)
+bytes_split (uintmax_t n_bytes, char *buf, size_t bufsize, uintmax_t max_files)
 {
   size_t n_read;
   bool new_file_flag = true;
   size_t to_read;
   uintmax_t to_write = n_bytes;
   char *bp_out;
+  uintmax_t opened = 1;
 
   do
     {
@@ -251,7 +271,8 @@ bytes_split (uintmax_t n_bytes, char *buf, size_t bufsize)
               cwrite (new_file_flag, bp_out, w);
               bp_out += w;
               to_read -= w;
-              new_file_flag = true;
+              new_file_flag = (opened++ < max_files || !max_files)?
+                              true : false;
               to_write = n_bytes;
             }
         }
@@ -277,10 +298,10 @@ lines_split (uintmax_t n_lines, char *buf, size_t bufsize)
         error (EXIT_FAILURE, errno, "%s", infile);
       bp = bp_out = buf;
       eob = bp + n_read;
-      *eob = '\n';
+      *eob = eol;
       for (;;)
         {
-          bp = memchr (bp, '\n', eob - bp + 1);
+          bp = memchr (bp, eol, eob - bp + 1);
           if (bp == eob)
             {
               if (eob != bp_out) /* do not write 0 bytes! */
@@ -340,7 +361,7 @@ line_bytes_split (size_t n_bytes)
       bp = buf + n_buffered;
       if (n_buffered == n_bytes)
         {
-          while (bp > buf && bp[-1] != '\n')
+          while (bp > buf && bp[-1] != eol)
             bp--;
         }
 
@@ -362,6 +383,328 @@ line_bytes_split (size_t n_bytes)
   free (buf);
 }
 
+/* Split into NUMBER eol chunks. */
+
+static void
+lines_chunk_split (size_t number, char *buf, size_t bufsize, size_t file_size)
+{
+  size_t n_read;
+  size_t chunk_no = 1;
+  off_t chunk_end = file_size / number - 1;
+  off_t offset = 0;
+  bool new_file_flag = true;
+  char *bp, *bp_out, *eob;
+
+  while (offset < file_size)
+    {
+      n_read = full_read (STDIN_FILENO, buf, bufsize);
+      if (n_read == SAFE_READ_ERROR)
+        error (EXIT_FAILURE, errno, "%s", infile);
+      bp = buf;
+      eob = buf + n_read;
+
+      while (1)
+        {
+          /* Begin lookng for eol at last byte of chunk. */
+          bp_out = (offset < chunk_end)? bp + chunk_end - offset : bp;
+          if (bp_out > eob)
+            bp_out = eob;
+          bp_out = memchr (bp_out, eol, eob - bp_out);
+          if (!bp_out)
+            {
+              /* Buffer exhausted. */
+              cwrite (new_file_flag, bp, eob - bp);
+              new_file_flag = false;
+              offset += eob - bp;
+              break;
+           }
+          else
+            bp_out++;
+
+          cwrite (new_file_flag, bp, bp_out - bp);
+          chunk_end = (++chunk_no < number)?
+                       chunk_end + file_size / number : file_size;
+          new_file_flag = true;
+          offset += bp_out - bp;
+          bp = bp_out;
+          /* A line could have been so long that it skipped
+             entire chunks. */
+          while (chunk_end < offset)
+            {
+              chunk_end += file_size / number;
+              chunk_no++;
+              /* Create blank file: this ensures NUMBER files are
+                 created. */
+              cwrite (true, bp, 0);
+            }
+        }
+    }
+}
+
+/* Extract Nth of TOTAL chunks. */
+
+static void
+bytes_chunk_extract (size_t n, size_t total, char *buf, size_t bufsize,
+                     size_t file_size)
+{
+  off_t start = (n == 0)? 0 : (n - 1) * (file_size / total);
+  off_t end = (n == total)? file_size : n * (file_size / total);
+  ssize_t n_read;
+  size_t n_write;
+
+  while (1)
+    {
+      n_read = pread (STDIN_FILENO, buf, bufsize, start);
+      if (n_read < 0)
+        error (EXIT_FAILURE, errno, "%s", infile);
+      n_write = (start + n_read <= end)? n_read : end - start;
+      if (full_write (STDOUT_FILENO, buf, n_write) != n_write)
+        error (EXIT_FAILURE, errno, "output error");
+      start += n_read;
+      if (end <= start)
+        return;
+    }
+}
+
+/* Extract lines whose first byte is in the Nth of TOTAL chunks. */
+
+static void
+lines_chunk_extract (size_t n, size_t total, char* buf, size_t bufsize,
+                     size_t file_size)
+{
+  ssize_t n_read;
+  bool end_of_chunk = false;
+  bool skip = true;
+  char *bp = buf, *bp_out = buf, *eob;
+  off_t start;
+  off_t end;
+
+  /* For n != 1, start reading 1 byte before nth chunk of file. This is to
+     detect if the first byte of chunk is the first byte of a line. */
+  if (n == 1)
+    {
+      start = 0;
+      skip = false;
+    }
+  else
+    start = (n - 1) * (file_size / total) - 1;
+  end = (n == total)? file_size - 1 : n * (file_size / total) - 1;
+
+  do
+    {
+      n_read = pread (STDIN_FILENO, buf, bufsize, start);
+      if (n_read < 0)
+        error (EXIT_FAILURE, errno, "%s", infile);
+      bp = buf;
+      bp_out = buf + n_read;
+      eob = bp_out;
+
+      /* Find starting point. */
+      if (skip)
+        {
+          bp = memchr (buf, eol, n_read);
+          if (bp && bp - buf < end - start)
+            {
+              bp++;
+              skip = false;
+            }
+          else if (!bp && start + n_read < end)
+            {
+              start += n_read;
+              continue;
+            }
+          else
+            return;
+        }
+
+      /* Find ending point. */
+      if (end < start + n_read && end == file_size - 1)
+         end_of_chunk = true;
+      else if (start + n_read >= end)
+        {
+          bp_out = (buf + end - start < buf)? buf : buf + end - start;
+          bp_out = memchr (bp_out, eol, eob - bp_out);
+          if (bp_out)
+            {
+              bp_out++;
+              end_of_chunk = true;
+            }
+          else
+            bp_out = eob;
+        }
+
+      if (write (STDOUT_FILENO, bp, bp_out - bp) != bp_out - bp)
+        error (EXIT_FAILURE, errno, "output error");
+      start += n_read;
+    }
+  while (!end_of_chunk);
+}
+
+
+
+typedef struct of_info
+{
+  char *of_name;
+  int ofd;
+} of_t;
+
+/* Rotates file descriptors when we're writing to more output files than we
+   have available file descriptors. */
+
+static void
+ofd_check (of_t *ofiles, size_t i, size_t n)
+{
+  if (0 < ofiles[i].ofd)
+    return;
+  else
+    {
+      int fd;
+      int j = i - 1;
+
+      /* Another process could have opened a file in between the calls to
+         close and open, so we should keep trying until open succeeds or
+         we've closed all of our files. */
+      while (1)
+        {
+          /* Attempt to open file. */
+          fd = open (ofiles[i].of_name,
+                     O_WRONLY | O_CREAT | O_TRUNC | O_BINARY,
+                     (S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP
+                      | S_IROTH | S_IWOTH));
+          if (-1 < fd)
+            break;
+          /* Find an open file to close. */
+          while (ofiles[j].ofd < 0)
+            {
+              if (--j == 0)
+                j = n - 1;
+              /* No more open files to close, exit with failure. */
+              if (j == i)
+                error (EXIT_FAILURE, 0, "%s", ofiles[i].of_name);
+            }
+          close (ofiles[j].ofd);
+        }
+      ofiles[i].ofd = fd;
+    }
+}
+
+/* Divide file into N chunks in round robin fashion. */
+
+static void
+lines_rr (size_t n, char *buf, size_t bufsize)
+{
+  of_t *ofiles = xnmalloc (n, sizeof *ofiles);
+  char *bp, *bp_out, *eob;
+  size_t n_read;
+  bool eof = false;
+  size_t i;
+  bool inc;
+
+  /* Generate output file names. */
+  for (i = 0; i < n; i++)
+    {
+      next_file_name ();
+      ofiles[i].of_name = xmalloc (strlen (outfile) + 1);
+      strcpy (ofiles[i].of_name, outfile);
+      ofiles[i].ofd = -1;
+    }
+  i = 0;
+
+  do
+    {
+      n_read = full_read (STDIN_FILENO, buf, bufsize);
+      if (n_read == SAFE_READ_ERROR)
+        error (EXIT_FAILURE, errno, "%s", infile);
+      if (n_read < bufsize)
+        {
+          if (n_read == 0)
+            break;
+          eof = true;
+        }
+      bp = buf;
+      eob = buf + n_read;
+
+
+      while (bp != eob)
+        {
+          /* Find end of line. */
+          bp_out = memchr (bp, eol, eob - bp);
+          if (bp_out)
+            {
+              bp_out++;
+              inc = true;
+            }
+          else
+            bp_out = eob;
+
+          /* Secure file descriptor. */
+          ofd_check (ofiles, i, n);
+
+          if (full_write (ofiles[i].ofd, bp, bp_out - bp) != bp_out - bp)
+            error (EXIT_FAILURE, errno, "%s", ofiles[i].of_name);
+          if (inc && ++i == n)
+            i = 0;
+          bp = bp_out;
+          inc = false;
+        }
+    }
+  while (!eof);
+
+  /* Close any open file descriptors. */
+  for (i = 0; i < n; i++)
+    if (-1 < ofiles[i].ofd)
+      close (ofiles[i].ofd);
+}
+
+/* Extract Nth of TOT eol delineated, round robin distributed chunks. */
+
+static void
+lines_rr_extract (uintmax_t n, uintmax_t tot, char *buf, size_t bufsize)
+{
+  int line_no = 1;
+  char *bp, *bp_out, *eob;
+  size_t n_read;
+  bool eof = false;
+  bool inc = false;
+
+  do
+    {
+      n_read = full_read (STDIN_FILENO, buf, bufsize);
+      if (n_read == SAFE_READ_ERROR)
+        error (EXIT_FAILURE, errno, "%s", infile);
+      if (n_read != bufsize)
+        {
+          if (n_read == 0)
+            break;
+          eof = true;
+        }
+      bp = buf;
+      eob = buf + n_read;
+
+      while (bp != eob)
+        {
+          /* Find end of line. */
+          bp_out = memchr (bp, eol, eob - bp);
+          if (bp_out)
+            {
+              bp_out++;
+              inc = true;
+            }
+          else
+            bp_out = eob;
+
+          if (line_no == n
+              && full_write (STDOUT_FILENO, bp, bp_out - bp) != bp_out - bp)
+            error (EXIT_FAILURE, errno, "output error");
+          if (inc)
+            line_no = (line_no == tot)? 1 : line_no + 1;
+          bp = bp_out;
+          inc = false;
+        }
+    }
+  while (!eof);
+}
+
 #define FAIL_ONLY_ONE_WAY()                    \
   do                                \
     {                                \
@@ -370,21 +713,159 @@ line_bytes_split (size_t n_bytes)
     }                                \
   while (0)
 
+/* Parse K/N syntax of chunk options. */
+
+static void
+chunk_parse (uintmax_t *m_units, uintmax_t *n_units, char *slash)
+{
+  *slash = '\0';
+  if (slash != optarg
+      && xstrtoumax (optarg, NULL, 10, m_units, "") != LONGINT_OK
+      || SIZE_MAX < *m_units)
+    {
+      error (0, 0, _("%s: invalid chunk number"), optarg);
+      usage (EXIT_FAILURE);
+    }
+  if (xstrtoumax (++slash, NULL, 10, n_units, "") != LONGINT_OK
+      || *n_units == 0 || *n_units < *m_units || SIZE_MAX < *n_units)
+    {
+      error (0, 0, _("%s: invalid number of total chunks"), slash);
+      usage (EXIT_FAILURE);
+    }
+}
+
+/* Parse eol character for -t option. */
+
+static void
+eol_parse ()
+{
+  if (*optarg == '\\')
+    switch (*(optarg+1))
+      {
+      case 'a':
+        if (*(optarg + 2) != 0)
+          error (EXIT_FAILURE, 0, _("%s: invalid escape sequence"), optarg);
+        eol = '\a';
+        break;
+
+      case 'b':
+        if (*(optarg + 2) != 0)
+          error (EXIT_FAILURE, 0, _("%s: invalid escape sequence"), optarg);
+        eol = '\b';
+        break;
+
+      case 'f':
+        if (*(optarg + 2) != 0)
+          error (EXIT_FAILURE, 0, _("%s: invalid escape sequence"), optarg);
+        eol = '\f';
+        break;
+
+      case 'n':
+        if (*(optarg + 2) != 0)
+          error (EXIT_FAILURE, 0, _("%s: invalid escape sequence"), optarg);
+        eol = '\n';
+        break;
+
+      case 'r':
+        if (*(optarg + 2) != 0)
+          error (EXIT_FAILURE, 0, _("%s: invalid escape sequence"), optarg);
+        eol = '\r';
+        break;
+
+      case 't':
+        if (*(optarg + 2) != 0)
+          error (EXIT_FAILURE, 0, _("%s: invalid escape sequence"), optarg);
+        eol = '\t';
+        break;
+
+      case 'v':
+        if (*(optarg + 2) != 0)
+          error (EXIT_FAILURE, 0, _("%s: invalid escape sequence"), optarg);
+        eol = '\v';
+        break;
+
+      case '\'':
+        if (*(optarg + 2) != 0)
+          error (EXIT_FAILURE, 0, _("%s: invalid escape sequence"), optarg);
+        eol = '\'';
+        break;
+
+      case '\"':
+        if (*(optarg + 2) != 0)
+          error (EXIT_FAILURE, 0, _("%s: invalid escape sequence"), optarg);
+        eol = '\"';
+        break;
+
+      case '\\':
+        if (*(optarg + 2) != 0)
+          error (EXIT_FAILURE, 0, _("%s: invalid escape sequence"), optarg);
+        eol = '\\';
+        break;
+
+      case '0':
+      case '1':
+      case '2':
+      case '3':
+      case '4':
+      case '5':
+      case '6':
+      case '7':
+        {
+          char *term;
+          long int tmp;
+          if (xstrtol (optarg + 1, &term, 8, &tmp, "") != LONGINT_OK
+              || tmp < 0 || 255 < tmp ||4 + optarg < term || *term != 0)
+            error (EXIT_FAILURE, 0, _("%s: invalid octal esacpe sequence"),
+                   optarg);
+          eol = (char) tmp;
+          break;
+        }
+
+      case 'x':
+        {
+          char *term;
+          long int tmp;
+          if (xstrtol (optarg + 2, &term, 16, &tmp, "") != LONGINT_OK
+              || tmp < 0 || 255 < tmp || 4 + optarg < term || *term != 0)
+            error (EXIT_FAILURE, 0, _("%s: invalid hex escape sequence"),
+                   optarg);
+          eol = (char) tmp;
+          break;
+        }
+
+      default:
+        error (0, 0, _("%s: invalid escape sequence"), optarg);
+        usage (EXIT_FAILURE);
+      }
+  else
+    {
+      if (*(optarg + 1) != 0)
+        error (EXIT_FAILURE, 0, _("%s: invalid eol character"), optarg);
+      eol = *optarg;
+    }
+}
+
+
 int
 main (int argc, char **argv)
 {
   struct stat stat_buf;
   enum
     {
-      type_undef, type_bytes, type_byteslines, type_lines, type_digits
+      type_undef, type_bytes, type_byteslines, type_lines, type_digits,
+      type_chunk_bytes, type_chunk_eol, type_rr
     } split_type = type_undef;
   size_t in_blk_size;        /* optimal block size of input file device */
   char *buf;            /* file i/o buffer */
   size_t page_size = getpagesize ();
+  uintmax_t m_units = 0;
   uintmax_t n_units;
   static char const multipliers[] = "bEGKkMmPTYZ0";
   int c;
   int digits_optind = 0;
+  size_t file_size;
+  char *slash;
+  bool eol_char = false;
 
   initialize_main (&argc, &argv);
   set_program_name (argv[0]);
@@ -404,7 +885,7 @@ main (int argc, char **argv)
       /* This is the argv-index of the option we will read next.  */
       int this_optind = optind ? optind : 1;
 
-      c = getopt_long (argc, argv, "0123456789C:a:b:dl:", longopts, NULL);
+      c = getopt_long (argc, argv, "0123456789C:a:b:c:dl:n:r:t:", longopts, 
NULL);
       if (c == -1)
         break;
 
@@ -426,6 +907,13 @@ main (int argc, char **argv)
         case 'b':
           if (split_type != type_undef)
             FAIL_ONLY_ONE_WAY ();
+          slash = strchr (optarg, '/');
+          if (slash)
+            {
+              split_type = type_chunk_bytes;
+              chunk_parse (&m_units, &n_units, slash);
+              break;
+            }
           split_type = type_bytes;
           if (xstrtoumax (optarg, NULL, 10, &n_units, multipliers) != 
LONGINT_OK
               || n_units == 0)
@@ -438,6 +926,13 @@ main (int argc, char **argv)
         case 'l':
           if (split_type != type_undef)
             FAIL_ONLY_ONE_WAY ();
+          slash = strchr (optarg, '/');
+          if (slash)
+            {
+              split_type = type_chunk_eol;
+              chunk_parse (&m_units, &n_units, slash);
+              break;
+            }
           split_type = type_lines;
           if (xstrtoumax (optarg, NULL, 10, &n_units, "") != LONGINT_OK
               || n_units == 0)
@@ -459,6 +954,42 @@ main (int argc, char **argv)
             }
           break;
 
+        case 'n':
+          if (split_type != type_undef)
+            FAIL_ONLY_ONE_WAY ();
+          split_type = type_chunk_bytes;
+          slash = strchr (optarg, '/');
+          if (slash)
+            {
+              chunk_parse (&m_units, &n_units, slash);
+              break;
+            }
+          if (xstrtoumax (optarg, NULL, 10, &n_units, "") != LONGINT_OK
+              || n_units == 0 || SIZE_MAX < n_units)
+            {
+              error (0, 0, _("%s: invalid number of chunks"), optarg);
+              usage (EXIT_FAILURE);
+            }
+          break;
+
+        case 'r':
+          if (split_type != type_undef)
+            FAIL_ONLY_ONE_WAY ();
+          split_type = type_rr;
+          slash = strchr (optarg, '/');
+          if (slash)
+            {
+              chunk_parse (&m_units, &n_units, slash);
+              break;
+            }
+          if (xstrtoumax (optarg, NULL, 10, &n_units, "") != LONGINT_OK
+              || n_units == 0 || SIZE_MAX < n_units)
+            {
+              error (0, 0, _("%s: invalid number of chunks"), optarg);
+              usage (EXIT_FAILURE);
+            }
+          break;
+
         case '0':
         case '1':
         case '2':
@@ -492,6 +1023,11 @@ main (int argc, char **argv)
           suffix_alphabet = "0123456789";
           break;
 
+        case 't':
+          eol_parse ();
+          eol_char = true;
+          break;
+
         case VERBOSE_OPTION:
           verbose = true;
           break;
@@ -505,6 +1041,17 @@ main (int argc, char **argv)
         }
     }
 
+  /* Default eol to \n if none specified. */
+  if (!eol_char)
+    eol = '\n';
+  else
+    {
+      if (split_type == type_chunk_bytes)
+        split_type = type_chunk_eol;
+      if (split_type == type_bytes)
+        split_type = type_byteslines;
+    }
+
   /* Handle default case.  */
   if (split_type == type_undef)
     {
@@ -546,10 +1093,15 @@ main (int argc, char **argv)
   output_desc = -1;
 
   /* Get the optimal block size of input device and make a buffer.  */
-
   if (fstat (STDIN_FILENO, &stat_buf) != 0)
     error (EXIT_FAILURE, errno, "%s", infile);
   in_blk_size = io_blksize (stat_buf);
+  file_size = stat_buf.st_size;
+
+  if (split_type == type_chunk_bytes || split_type == type_chunk_eol
+      || split_type == type_rr)
+    if (file_size < n_units)
+      error (EXIT_FAILURE, errno, "number of chunks exceed file size");
 
   buf = ptr_align (xmalloc (in_blk_size + 1 + page_size - 1), page_size);
 
@@ -561,13 +1113,34 @@ main (int argc, char **argv)
       break;
 
     case type_bytes:
-      bytes_split (n_units, buf, in_blk_size);
+      bytes_split (n_units, buf, in_blk_size, 0);
       break;
 
     case type_byteslines:
       line_bytes_split (n_units);
       break;
 
+    case type_chunk_bytes:
+      if (m_units == 0)
+        bytes_split (file_size / n_units, buf, in_blk_size, n_units);
+      else
+        bytes_chunk_extract (m_units, n_units, buf, in_blk_size, file_size);
+      break;
+
+    case type_chunk_eol:
+      if (m_units == 0)
+        lines_chunk_split (n_units, buf, in_blk_size, file_size);
+      else
+        lines_chunk_extract (m_units, n_units, buf, in_blk_size, file_size);
+      break;
+
+    case type_rr:
+      if (m_units == 0)
+        lines_rr (n_units, buf, in_blk_size);
+      else
+        lines_rr_extract (m_units, n_units, buf, in_blk_size);
+      break;
+
     default:
       abort ();
     }
diff --git a/tests/Makefile.am b/tests/Makefile.am
index 85503cc..89d2e40 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -228,8 +228,12 @@ TESTS =                        \
   misc/sort-rand                \
   misc/sort-version                \
   misc/split-a                    \
+  misc/split-bchunk                \
   misc/split-fail                \
   misc/split-l                    \
+  misc/split-lchunk                \
+  misc/split-rchunk                \
+  misc/split-t                    \
   misc/stat-fmt                    \
   misc/stat-hyphen                \
   misc/stat-printf                \
diff --git a/tests/misc/split-bchunk b/tests/misc/split-bchunk
new file mode 100755
index 0000000..15c0d64
--- /dev/null
+++ b/tests/misc/split-bchunk
@@ -0,0 +1,46 @@
+#!/bin/sh
+# show that splitting into 3 byte delineated chunks works.
+
+# Copyright (C) 2009 Free Software Foundation, Inc.
+
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+if test "$VERBOSE" = yes; then
+  set -x
+  split --version
+fi
+. $srcdir/test-lib.sh
+
+printf '1\n2\n3\n4\n5\n' > in || framework_failure
+
+split --bytes=/3 in > out || fail=1
+split --bytes=1/3 in > b1 || fail=1
+split --bytes=2/3 in > b2 || fail=1
+split --bytes=3/3 in > b3 || fail=1
+echo -n -e 1'\n'2 > exp-1
+echo -e '\n'3 > exp-2
+echo -e 4'\n'5 > exp-3
+
+compare xaa exp-1 || fail=1
+compare xab exp-2 || fail=1
+compare xac exp-3 || fail=1
+compare b1 exp-1 || fail=1
+compare b2 exp-2 || fail=1
+compare b3 exp-3 || fail=1
+test -f xad && fail=1
+
+# Splitting into more chunks than file size should fail.
+split --bytes=/20 in 2> /dev/null && fail=1
+
+Exit $fail
diff --git a/tests/misc/split-fail b/tests/misc/split-fail
index e36c86d..4a0c9c3 100755
--- a/tests/misc/split-fail
+++ b/tests/misc/split-fail
@@ -29,8 +29,11 @@ touch in || framework_failure
 
 split -a 0 in 2> /dev/null || fail=1
 split -b 0 in 2> /dev/null && fail=1
+split -b /0 in 2> /dev/null && fail=1
 split -C 0 in 2> /dev/null && fail=1
 split -l 0 in 2> /dev/null && fail=1
+split -l /0 in 2> /dev/null && fail=1
+split -t in 2> /dev/null && fail=1
 
 # Make sure -C doesn't create empty files.
 rm -f x?? || fail=1
@@ -64,5 +67,10 @@ split: line count option -99*... is too large
 EOF
 compare out exp || fail=1
 
+# Make sure invalid -t characters are not accepted.
+split -tab in 2> /dev/null && fail=1;
+split -t\\nb in 2> /dev/null && fail=1;
+split -t\\8 in 2> /dev/null && fail=1;
+split -t\\x1FF 2> /dev/null && fail=1;
 
 Exit $fail
diff --git a/tests/misc/split-l b/tests/misc/split-l
index fb07a27..850d5b5 100755
--- a/tests/misc/split-l
+++ b/tests/misc/split-l
@@ -18,7 +18,7 @@
 
 if test "$VERBOSE" = yes; then
   set -x
-  ln --version
+  split --version
 fi
 
 . $srcdir/test-lib.sh
diff --git a/tests/misc/split-lchunk b/tests/misc/split-lchunk
new file mode 100755
index 0000000..cb71939
--- /dev/null
+++ b/tests/misc/split-lchunk
@@ -0,0 +1,56 @@
+#!/bin/sh
+# show that splitting into 3 newline delineated chunks works.
+
+# Copyright (C) 2009 Free Software Foundation, Inc.
+
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+if test "$VERBOSE" = yes; then
+  set -x
+  ln --version
+fi
+
+. $srcdir/test-lib.sh
+
+printf '1\n2\n3\n4\n5\n' > in || framework_failure
+
+split --lines=/3 in > out || fail=1
+split --lines=1/3 in > l1 || fail=1
+split --lines=2/3 in > l2 || fail=1
+split --lines=3/3 in > l3 || fail=1
+
+cat <<\EOF > exp-1
+1
+2
+EOF
+cat <<\EOF > exp-2
+3
+EOF
+cat <<\EOF > exp-3
+4
+5
+EOF
+
+compare xaa exp-1 || fail=1
+compare xab exp-2 || fail=1
+compare xac exp-3 || fail=1
+compare l1 exp-1 || fail=1
+compare l2 exp-2 || fail=1
+compare l3 exp-3 || fail=1
+test -f xad && fail=1
+
+# Splitting into more chunks than file size should fail.
+split --bytes=/20 in 2> /dev/null && fail=1
+
+Exit $fail
diff --git a/tests/misc/split-rchunk b/tests/misc/split-rchunk
new file mode 100755
index 0000000..080e6a2
--- /dev/null
+++ b/tests/misc/split-rchunk
@@ -0,0 +1,56 @@
+#!/bin/sh
+# show that splitting into 3 round-robin chunks works.
+
+# Copyright (C) 2009 Free Software Foundation, Inc.
+
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+if test "$VERBOSE" = yes; then
+  set -x
+  ln --version
+fi
+
+. $srcdir/test-lib.sh
+
+printf '1\n2\n3\n4\n5\n' > in || framework_failure
+
+split --round-robin=/3 in > out || fail=1
+split --round-robin=1/3 in > r1 || fail=1
+split --round-robin=2/3 in > r2 || fail=1
+split --round-robin=3/3 in > r3 || fail=1
+
+cat <<\EOF > exp-1
+1
+4
+EOF
+cat <<\EOF > exp-2
+2
+5
+EOF
+cat <<\EOF > exp-3
+3
+EOF
+
+compare xaa exp-1 || fail=1
+compare xab exp-2 || fail=1
+compare xac exp-3 || fail=1
+compare r1 exp-1 || fail=1
+compare r2 exp-2 || fail=1
+compare r3 exp-3 || fail=1
+test -f xad && fail=1
+
+# Splitting into more chunks than file size should fail.
+split --bytes=/20 in 2> /dev/null && fail=1
+
+Exit $fail
diff --git a/tests/misc/split-t b/tests/misc/split-t
new file mode 100755
index 0000000..4fba0f2
--- /dev/null
+++ b/tests/misc/split-t
@@ -0,0 +1,39 @@
+#!/bin/sh
+# show that splitting with '\0' as the eol char works.
+
+# Copyright (C) 2009 Free Software Foundation, Inc.
+
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+if test "$VERBOSE" = yes; then
+  set -x
+  split --version
+fi
+
+. $srcdir/test-lib.sh
+
+echo -n -e a'\0'b'\0'c'\0'd'\0'e'\0' > in || framework_failure
+
+split -l 2 -t \\0  in > out || fail=1
+
+echo -n -e a'\0'b'\0' > exp-1
+echo -n -e c'\0'd'\0' > exp-2
+echo -n -e e'\0' > exp-3
+
+compare xaa exp-1 || fail=1
+compare xab exp-2 || fail=1
+compare xac exp-3 || fail=1
+test -f xad && fail=1
+
+Exit $fail
-- 
1.6.3.3

Re: [PATCH] split: --chunks option

Reply via email to