Pádraig Brady wrote: > I got a bit of time for the review last night... > > This was your last interface change for this: > > -b, --bytes=SIZE put SIZE bytes per output file\n\ > + -b, --bytes=/N generate N output files\n\ > + -b, --bytes=K/N print Kth of N chunks of file\n\ > -C, --line-bytes=SIZE put at most SIZE bytes of lines per output file\n\ > -d, --numeric-suffixes use numeric suffixes instead of alphabetic\n\ > -l, --lines=NUMBER put NUMBER lines per output file\n\ > + -l, --lines=/N generate N eol delineated output files\n\ > + -l, --lines=K/N print Kth of N eol delineated chunks\n\ > + -n, --number=N same as --bytes=/N\n\ > + -n, --number=K/N same as --bytes=K/N\n\ > + -r, --round-robin=N generate N eol delineated output files using\n\ > + round-robin style distribution.\n\ > + -r. --round-robin=K/N print Kth of N eol delineated chunk as -rN would\n\ > + have generated.\n\ > + -t, --term=CHAR specify CHAR as eol. This will also convert\n\ > + -b to its line delineated equivalent (-C if\n\ > + splitting normally, -l if splitting by\n\ > + chunks). C escape sequences are accepted.\n\ > > Thinking more about it, I think adding 2 modes of operation to the > already slightly complicated -bCl options is too confusing. > Since this is a separate mode of operation; i.e. one would be > specifying a particular number of files for a different reason > than a particular size, it would be better as a separate option. > So I changed -n to operate as follows. This is more general if > we want to add new split methods in future, and also compatible with > the existing BSD -n without needing a redundant option.
I like it. > -n N split into N files based on size of input > -n K/N output K of N to stdout s/K/Kth/ removes an ambiguity (and two more, below) > -n l/N split into N files while maintaining lines s/while maintaining/without splitting/ (and below) > -n l/K/N output K of N to stdout while maintaining lines > -n r/N like `l' but use round robin distribution instead of size > -n r/K/N likewise but only output K of N to stdout > > Other changes I made in the attached version are: > > Removed the -t option as that's separate. > Removed erroneous 'c' from getopt() parameters. > Use K/N in code rather than M/N to match user instructions. > Added suffix len setter/checker based on N so that > we fail immediately if the wrong -a is specified, or > if -a is not specified we auto set it. > Flagged 0/N as an error, rather than treating like /N. > Changed r/K/N to buffer using stdio for much better performance (see below) > Fixed up the errno on some errors() > Normalized all "write error" messages so that all of these commands > output a single translated error message, of the form: > "split: write error: No space left on device" > split -n 1/10 $(which split) >/dev/full > stdbuf -o0 split -n 1/10 $(which split) >/dev/full > seq 10 | split -n r/1/10 >/dev/full > seq 10 | stdbuf -o0 split -n r/1/10 >/dev/full Nice. > Re the performance of the round robin implementation; > using stdio helps a LOT as can be seen with: > ------------------------------------------------------- > $ time yes | head -n10000000 | ./split-fwrite -n r/1/1 | wc -l > 10000000 > > real 0m1.568s > user 0m1.486s > sys 0m0.072s > > $ time yes | head -n10000000 | ./split-write -n r/1/1 | wc -l > 10000000 > > real 0m50.988s > user 0m7.548s > sys 0m43.250s Indeed. We don't see a 32-x speed-up (esp. non-algorithmic) very often. > I still need to look at the round robin implementation when > outputting to file rather than stdout. I may default to using > stdio, but give an option to flush each line. I'm testing > with this currently which is performing badly just doing write() > ------------------------------------------------------- > #create fifos > yes | head -n4 | ../split -n r/4 fifo > for f in x*; do rm $f && mkfifo $f; done > > #consumer > (for f in x*; do md5sum $f& done) > md5sum.out > > #producer > seq 100000 | split -n r/4 > ------------------------------------------------------- > > BTW, other modes perform well with write() > ------------------------------------------------------- > $ yes | head -n10000000 > 10m.txt > $ time ./split -n l/1/1 <10m.txt | wc -l > 10000000 > > real 0m0.201s > user 0m0.145s > sys 0m0.043s > > $ time ./split -n 1/1 <10m.txt | wc -l > 10000000 > > real 0m0.199s > user 0m0.154s > sys 0m0.041s > > $ time ./split -n 1 <10m.txt > > real 0m0.088s > user 0m0.000s > sys 0m0.081s > ------------------------------------------------------- > > Here is stuff I intend TODO before checking in: > s/pread()/dd::skip()/ or at least add pread to bootstrap.conf > fix info docs for reworked interface > try to refactor duplicated code Thanks for doing all that!
