Not a Julia solution, but there are some standard UNIX command line tool
for this: comm <> and uniq
<>. These are how I would go about doing
this kind of work, at least initially – it's what they're made for and
they're very efficient.

If you want to do this stuff in Julia, however, it's pretty easy too.
You'll make extensive use of Sets for this: read a lines from one file and
add it to a Set; then read the other file and look it up in the Set to see
if it was present. E.g. the following script prints lines present in both

#!/usr/bin/env julia

length(ARGS) == 2 || error("two arguments expected")

const lines = Set{UTF8String}()

open(ARGS[1]) do f
    for line in eachline(f)
        push!(lines, chomp(line))

open(ARGS[2]) do f
    for line in eachline(f)
        line = chomp(line)
        if line in lines

You can also do these sorts of things in a more library-like way:

function lines(io::IO)
    line_set = Set{UTF8String}()
    for line in eachline(io)
        push!(line_set, chomp(line))
    return line_set
lines(path::AbstractString) = open(lines, path)

julia> w1 = lines("/usr/share/dict/words") # OS X system dictionary

julia> w2 = lines("/Users/stefan/tmp/words") # dictionary copied from a
Linux system

julia> length(w1)

julia> length(w2)

julia> w1 ∩ w2

julia> length(w1 ∩ w2)

On a computer with a decent amount of RAM, dealing with data that's
hundreds of MB shouldn't be a problem.

On Mon, Feb 22, 2016 at 8:13 AM, barbara.g <>

> Hi !
> I have just stepped into Julia, I didn't know about it (her...) before.
> I must handle plain text file, sized many hundreds of MB, and above, each;
> the goal is to remove duplicate lines, to find lines present in both of
> two, to find lines present in one and not in another, etc.
> I was used to do such operations in Mathematica because it perfectly fitts
> my needs when files are small enough to be loaded entirely in RAM, but when
> they grow up the jobs become practically infeacible.
> Can Julia provide an (the more or less) out the box solution or, at least,
> an easily programmable one ?
> Your sincerely !

Reply via email to