Re: split(1): add '-c' to continue creating files

2023-02-14 Thread Jan Schaumann
Robert Elz  wrote:

> Most of the rest of this proposal is (a disaster) - it is far too
> complicated with two many pitfalls, for very little rational benefit.

¯\_(ツ)_/¯ 

-Jan


Re: split(1): add '-c' to continue creating files

2023-02-14 Thread Robert Elz
Date:Tue, 14 Feb 2023 12:06:03 -0500
From:Jan Schaumann 
Message-ID:  

  | Setting the first name is a good alternative.

Or just the first suffix, an option for that would not be a disaster.
But it really shouldn't be needed.

Most of the rest of this proposal is (a disaster) - it is far too
complicated with two many pitfalls, for very little rational benefit.

The normal unix type way to handle this would be to put the split
files for each original file in a directory of their own (a new directory
made just for the purpose).   Then if you want to treat them all as
a single file later, it is just sd?/* (or sd* if you have that many)
and if you want to deal with the split parts of one file, you don't
need to try and work out where one ended and the next starts.

unix directories are cheap, this isn't windows or VMS ... use them!

Note that there's no reason the "name" arg to split cannot be "d1/"
(it works just as one would expect it to.)

And while the man page doesn't say so, the "file" arg can be "-"
(as well as absent) to read stdin, so the name arg can be given in
that case as well.

  | >  $ cat file second-file | split

  | That only works if I have both files available

It also only works if you don't mind the possibility that one of the
pieces has lines (or data anyway, depending upon the split options)
from both the first and second files, as that way split cannot tell
where one ends and the other starts.

kre




Re: split(1): add '-c' to continue creating files

2023-02-14 Thread Valery Ushakov
On Tue, Feb 14, 2023 at 17:31:36 +0100, Martin Husemann wrote:

> On Sun, Feb 12, 2023 at 04:05:20PM -0500, Jan Schaumann wrote:
> > The attached diff adds a flag "-c" (mnemonic "create,
> > don't overwrite" or "continue where you left off"):
> > 
> > $ split file; ls
> > xaa xab xac xad
> > $ split -c second-file; ls
> > xaa xab xac xad xae xaf xag xah xai xaj
> 
> I think this is a dangerous and non-obvious user interface, especially
> when we hit collisions later or data changes and we are re-doing the split.

I dislike this idea too.


> How about instead adding an option that sets the first name explicitly
> and keeps the "abort on failure" behaviour?

gnu coreutils split(1) has --numeric-suffixes[=FROM] and
--hex-suffixes[=FROM]

May be --text-suffixes=FROM might fit this pattern though to me these
extensions don't seem too elegant and is stretching the original past
its breaking point.

-uwe


Re: split(1): add '-c' to continue creating files

2023-02-14 Thread Mouse
>> Besides, isn't your intended behaviour easily done with:

>>  $ cat file second-file | split

> That only works if I have both files available at the time I run the
> split command.

It also will (unless the first file is a multiple of the split size)
take the last part of file and the first part of second-file and put
them in the same split file.  It also imposes the same split size on
both input files.

In contrast, the proposed behaviour never puts pieces from different
input files in the same output file and permits different fragment
sizes for different input files.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: split(1): add '-c' to continue creating files

2023-02-14 Thread Jan Schaumann
Martin Husemann  wrote:

> How about instead adding an option that sets the first name explicitly
> and keeps the "abort on failure" behaviour?

Setting the first name is a good alternative.  I'll
have to see how that works with specifying a prefix
(e.g., user specified a first file that doesn't match
the prefix), but I'll give it a look later.

> Besides, isn't your intended behaviour easily done with:
> 
>  $ cat file second-file | split
> 
> ?

That only works if I have both files available at the
time I run the split command.  I also have cases where
I split one file and only later want to split a second
file but produce output files continuing the sequence.

-Jan


Re: split(1): add '-c' to continue creating files

2023-02-14 Thread Edgar Fuß
> How about instead adding an option that sets the first name explicitly
> and keeps the "abort on failure" behaviour?
That looks like a much better idea to me.


Re: split(1): add '-c' to continue creating files

2023-02-14 Thread Martin Husemann
On Sun, Feb 12, 2023 at 04:05:20PM -0500, Jan Schaumann wrote:
> The attached diff adds a flag "-c" (mnemonic "create,
> don't overwrite" or "continue where you left off"):
> 
> $ split file; ls
> xaa xab xac xad
> $ split -c second-file; ls
> xaa xab xac xad xae xaf xag xah xai xaj

I think this is a dangerous and non-obvious user interface, especially
when we hit collisions later or data changes and we are re-doing the split.

How about instead adding an option that sets the first name explicitly
and keeps the "abort on failure" behaviour?

Besides, isn't your intended behaviour easily done with:

 $ cat file second-file | split

?

Martin


Re: split(1): add '-c' to continue creating files

2023-02-14 Thread Mouse
> $ split -n 4 -c file; ls
> xaa xab xac xad xae xaf xag
> --- --- --- ---

> I don't see a way around that: split(1) would need to look ahead at
> _any_ possible file to be able to determine if the current file name
> falls into a hole in the sequence.

That isn't that hard to do, assuming the containing directory is
readable to the user running split, though there is still a race if two
split instances are writing with the same prefix in the same directory.
But _that_ race is pretty much unavoidable.

> If you think it's worth calling out, we could try to do so in the
> manual page: [...]

Could be worth doing.  Perhaps split could watch for this and (possibly
optionally) warn to stderr if it happens?

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: split(1): add '-c' to continue creating files

2023-02-14 Thread Jan Schaumann
Ignatios Souvatzis  wrote:
 
> Definitely O_EXCL and EEXIST, yes. But we still can fall into a hole
> in the sequence, fill it, and skip over the remaining part(s), thus
> interleaving our new and the preexisting files.

Ah, you mean if I currently have

$ ls
xaa xad xae

and then run

$ split -n 4 -c file; ls
xaa xab xac xad xae xaf xag
--- --- --- ---


I don't see a way around that: split(1) would need to
look ahead at _any_ possible file to be able to
determine if the current file name falls into a hole
in the sequence.

If you think it's worth calling out, we could try to
do so in the manual page:

"If the -c flag is specified, split will instead
continue to generate output file names until it finds
one that does not already exist.  (Note: this may fill
a "hole" in a pre-existing sequence of files such that
the final list of all output files may end up out of
sequence.)"

?

-Jan


Re: split(1): add '-c' to continue creating files

2023-02-14 Thread Ignatios Souvatzis
On Sun, Feb 12, 2023 at 04:19:43PM -0500, Jan Schaumann wrote:
> Jan Schaumann  wrote:
> 
> > The attached diff adds a flag "-c" (mnemonic "create,
> > don't overwrite" or "continue where you left off"):
> 
> Ugh, and once more without a race condition. [...]


Definitely O_EXCL and EEXIST, yes. But we still can fall into a hole
in the sequence, fill it, and skip over the remaining part(s), thus
interleaving our new and the preexisting files.

I tried to wrap my brain around how to use a flag or two to detect
detect this so that we can err out at the second part of the
preexisting sequence, but the damage would have happened already.

Maybe the programmer should be warned to use an exclusive directory for
his (multiple) splits.

-is


Re: split(1): add '-c' to continue creating files

2023-02-12 Thread Jan Schaumann
Jan Schaumann  wrote:

> The attached diff adds a flag "-c" (mnemonic "create,
> don't overwrite" or "continue where you left off"):

Ugh, and once more without a race condition.

-Jan
Index: split.1
===
RCS file: /cvsroot/src/usr.bin/split/split.1,v
retrieving revision 1.16
diff -u -p -r1.16 split.1
--- split.1 30 Jan 2023 15:22:02 -  1.16
+++ split.1 12 Feb 2023 21:18:10 -
@@ -29,7 +29,7 @@
 .\"
 .\"@(#)split.1 8.3 (Berkeley) 4/16/94
 .\"
-.Dd January 28, 2023
+.Dd February 12, 2023
 .Dt SPLIT 1
 .Os
 .Sh NAME
@@ -37,6 +37,7 @@
 .Nd split a file into pieces
 .Sh SYNOPSIS
 .Nm
+.Op Fl c
 .Op Fl a Ar suffix_length
 .Oo
 .Fl b Ar byte_count Ns Oo Li k|m Oc |
@@ -78,6 +79,9 @@ If
 is appended to the number, the file is split into
 .Ar byte_count
 megabyte pieces.
+.It Fl c
+Continue creating files and do not overwrite existing
+output files.
 .It Fl l
 Create smaller files
 .Ar line_count
@@ -111,6 +115,16 @@ If the
 argument is not specified,
 .Ql x
 is used.
+.Pp
+By default,
+.Nm
+will overwrite any existing output files.
+If the
+.Fl c
+flag is specified,
+.Nm
+will instead continue to generate output file names
+until it finds one that does not already exist.
 .Sh STANDARDS
 The
 .Nm
Index: split.c
===
RCS file: /cvsroot/src/usr.bin/split/split.c,v
retrieving revision 1.30
diff -u -p -r1.30 split.c
--- split.c 12 Feb 2023 20:43:21 -  1.30
+++ split.c 12 Feb 2023 21:18:11 -
@@ -56,6 +56,7 @@ __RCSID("$NetBSD: split.c,v 1.30 2023/02
 
 #define DEFLINE1000/* Default num lines per file. */
 
+static int clobber = 1; /* Whether to overwrite existing output files. 
*/
 static int file_open;  /* If a file is open. */
 static int ifd = STDIN_FILENO, ofd = -1; /* Input/output file descriptors. */
 static char *fname;/* File name prefix. */
@@ -79,7 +80,7 @@ main(int argc, char *argv[])
off_t numlines = 0; /* Line count to split on. */
off_t chunks = 0;   /* Number of chunks to split into. */
 
-   while ((ch = getopt(argc, argv, "0123456789b:l:a:n:")) != -1)
+   while ((ch = getopt(argc, argv, "0123456789b:cl:a:n:")) != -1)
switch (ch) {
case '0': case '1': case '2': case '3': case '4':
case '5': case '6': case '7': case '8': case '9':
@@ -115,6 +116,9 @@ main(int argc, char *argv[])
else if (*ep == 'm')
bytecnt *= 1024 * 1024;
break;
+   case 'c':   /* Continue, don't overwrite output 
files. */
+   clobber = 0;
+   break;
case 'l':   /* Line count. */
if (numlines != 0)
usage();
@@ -317,6 +321,11 @@ newfile(void)
static int fnum;
static char *fpnt;
int quot, i;
+   int flags = O_WRONLY | O_CREAT | O_TRUNC;
+
+   if (!clobber) {
+   flags |= O_EXCL;
+   }
 
if (ofd == -1) {
fpnt = fname + strlen(fname);
@@ -324,6 +333,7 @@ newfile(void)
} else if (close(ofd) != 0)
err(1, "%s", fname);
 
+again:
quot = fnum;
 
/* If '-a' is not specified, then we automatically expand the
@@ -364,8 +374,13 @@ newfile(void)
if (quot > 0)
errx(1, "too many files.");
++fnum;
-   if ((ofd = open(fname, O_WRONLY | O_CREAT | O_TRUNC, DEFFILEMODE)) < 0)
+
+   if ((ofd = open(fname, flags, DEFFILEMODE)) < 0) {
+   if (!clobber && errno == EEXIST) {
+   goto again;
+   }
err(1, "%s", fname);
+   }
 }
 
 static size_t


split(1): add '-c' to continue creating files

2023-02-12 Thread Jan Schaumann
Hello,

Currently, split(1) will clobber any existing output
files:

$ split file; ls
xaa xab xac xad
$ split second-file; ls
xaa xab xac xad xae xaf

I often would like for it to pick up where it left off
and continue creating files in the sequence.  Right
now, there is no good way for me to yield the desired
output of "xaa xab xac xad xae xaf xag xah xai xaj".

The attached diff adds a flag "-c" (mnemonic "create,
don't overwrite" or "continue where you left off"):

$ split file; ls
xaa xab xac xad
$ split -c second-file; ls
xaa xab xac xad xae xaf xag xah xai xaj

Any objections?

-Jan
Index: split.1
===
RCS file: /cvsroot/src/usr.bin/split/split.1,v
retrieving revision 1.16
diff -u -p -r1.16 split.1
--- split.1 30 Jan 2023 15:22:02 -  1.16
+++ split.1 12 Feb 2023 20:57:09 -
@@ -29,7 +29,7 @@
 .\"
 .\"@(#)split.1 8.3 (Berkeley) 4/16/94
 .\"
-.Dd January 28, 2023
+.Dd February 12, 2023
 .Dt SPLIT 1
 .Os
 .Sh NAME
@@ -37,6 +37,7 @@
 .Nd split a file into pieces
 .Sh SYNOPSIS
 .Nm
+.Op Fl c
 .Op Fl a Ar suffix_length
 .Oo
 .Fl b Ar byte_count Ns Oo Li k|m Oc |
@@ -78,6 +79,9 @@ If
 is appended to the number, the file is split into
 .Ar byte_count
 megabyte pieces.
+.It Fl c
+Continue creating files and do not overwrite existing
+output files.
 .It Fl l
 Create smaller files
 .Ar line_count
@@ -111,6 +115,16 @@ If the
 argument is not specified,
 .Ql x
 is used.
+.Pp
+By default,
+.Nm
+will overwrite any existing output files.
+If the
+.Fl c
+flag is specified,
+.Nm
+will instead continue to generate output file names
+until it finds one that does not already exist.
 .Sh STANDARDS
 The
 .Nm
Index: split.c
===
RCS file: /cvsroot/src/usr.bin/split/split.c,v
retrieving revision 1.30
diff -u -p -r1.30 split.c
--- split.c 12 Feb 2023 20:43:21 -  1.30
+++ split.c 12 Feb 2023 20:57:09 -
@@ -56,6 +56,7 @@ __RCSID("$NetBSD: split.c,v 1.30 2023/02
 
 #define DEFLINE1000/* Default num lines per file. */
 
+static int clobber = 1; /* Whether to overwrite existing output files. 
*/
 static int file_open;  /* If a file is open. */
 static int ifd = STDIN_FILENO, ofd = -1; /* Input/output file descriptors. */
 static char *fname;/* File name prefix. */
@@ -79,7 +80,7 @@ main(int argc, char *argv[])
off_t numlines = 0; /* Line count to split on. */
off_t chunks = 0;   /* Number of chunks to split into. */
 
-   while ((ch = getopt(argc, argv, "0123456789b:l:a:n:")) != -1)
+   while ((ch = getopt(argc, argv, "0123456789b:cl:a:n:")) != -1)
switch (ch) {
case '0': case '1': case '2': case '3': case '4':
case '5': case '6': case '7': case '8': case '9':
@@ -115,6 +116,9 @@ main(int argc, char *argv[])
else if (*ep == 'm')
bytecnt *= 1024 * 1024;
break;
+   case 'c':   /* Continue, don't overwrite output 
files. */
+   clobber = 0;
+   break;
case 'l':   /* Line count. */
if (numlines != 0)
usage();
@@ -324,6 +328,7 @@ newfile(void)
} else if (close(ofd) != 0)
err(1, "%s", fname);
 
+again:
quot = fnum;
 
/* If '-a' is not specified, then we automatically expand the
@@ -364,6 +369,11 @@ newfile(void)
if (quot > 0)
errx(1, "too many files.");
++fnum;
+
+   if (!clobber && (access(fname, F_OK) == 0)) {
+   goto again;
+   }
+
if ((ofd = open(fname, O_WRONLY | O_CREAT | O_TRUNC, DEFFILEMODE)) < 0)
err(1, "%s", fname);
 }