On Thu, Jan 2, 2014 at 10:26 AM, Chris Wood <ch...@tincreek.com> wrote:
> On Thu, Jan 2, 2014 at 10:21 AM, S. Dale Morrey <sdalemor...@gmail.com> wrote:
>> I should have mentioned that the data is UTF-8 not strictly ASCII and we
>> need to preserve the encoding.
>
> Makes me think of sed.
>
> http://stackoverflow.com/questions/3557037/appending-a-line-to-a-file-only-if-it-doesnt-already-exist-using-sed

Assuming you don't need the resulting file to be in any particular
order, the approaches here are the best I've seen so far in terms of
space efficiency for shell-based utilities.  Might take unacceptably
long with your huge files, though. Take the first line of the small
file, grep the big file for that line, and see how long it takes.
Multiply that by the number of lines in the small file, and that's the
ballpark your final runtime will be in.

A bit of research tells me that the unix sort command uses an external
R-way merge sort, which is probably close enough to the optimal way to
do it when you use the -u flag as well. You might tweak command line
options to adjust the temp file sizes, scratch memory it's allowed to
use, number of parallel threads, etc. If you have enough temp space
for the temp files (it will make multiple temp files during operation)
then it's definitely the way to go in the solution space of
already-installed command line utilities. If space is tight, you can
pass a command line option that will keep the temp files compressed.
The compression command needs to support the -d option to uncompress.
See the info page for more details.

Here are some interesting references:
http://vkundeti.blogspot.com/2008/03/tech-algorithmic-details-of-unix-sort.html
http://en.wikipedia.org/wiki/External_sorting

If you need industrial-class sorting solutions, here's a benchmark
page comparing various state-of-the-art sorting algorithms according
to performance on several interesting benchmarks:
http://sortbenchmark.org/

For external algorithms in general, here's a C++ library that could be
useful: http://stxxl.sourceforge.net/

And for learning about the design and analysis of external memory
algorithms, here's a free PDF textbook:
http://www.ittc.ku.edu/~jsv/Papers/Vit.IO_book.pdf

As an aside, I found this article on doing set operations in shell
script interesting:
http://www.catonmat.net/blog/set-operations-in-unix-shell/

        --Levi

/*
PLUG: http://plug.org, #utah on irc.freenode.net
Unsubscribe: http://plug.org/mailman/options/plug
Don't fear the penguin.
*/

Reply via email to