Re: [julia-users] Plain text files manipulation

Stefan Karpinski Mon, 22 Feb 2016 08:31:49 -0800

Not a Julia solution, but there are some standard UNIX command line tool
for this: comm <https://en.wikipedia.org/wiki/Comm> and uniq
<https://en.wikipedia.org/wiki/Uniq>. These are how I would go about doing
this kind of work, at least initially – it's what they're made for and
they're very efficient.


If you want to do this stuff in Julia, however, it's pretty easy too.
You'll make extensive use of Sets for this: read a lines from one file and
add it to a Set; then read the other file and look it up in the Set to see
if it was present. E.g. the following script prints lines present in both
arguments:

#!/usr/bin/env julia

length(ARGS) == 2 || error("two arguments expected")

const lines = Set{UTF8String}()

open(ARGS[1]) do f
    for line in eachline(f)
        push!(lines, chomp(line))
    end
end

open(ARGS[2]) do f
    for line in eachline(f)
        line = chomp(line)
        if line in lines
            println(line)
        end
    end
end


You can also do these sorts of things in a more library-like way:

function lines(io::IO)
    line_set = Set{UTF8String}()
    for line in eachline(io)
        push!(line_set, chomp(line))
    end
    return line_set
end
lines(path::AbstractString) = open(lines, path)

julia> w1 = lines("/usr/share/dict/words") # OS X system dictionary
Set(UTF8String["diseaseful","xenyl","Dezaley","ironheartedly","nimbused","ungoverned","tarantass","hatlessness","titration","photosynthesis"
 …
 
"ponderous","shorten","metroptosia","detractiveness","microbium","boater","navette","tridiametral","notekin","infidelic"])

julia> w2 = lines("/Users/stefan/tmp/words") # dictionary copied from a
Linux system
Set(UTF8String["rearrangement","pintoes","dial's","inattentive","pewee's","photosynthesis","sleepwalking","caring's","cirrhosis's","entomb"
 …
 
"affidavit's","boater","deli's","gray's","Concetta","vituperates","overtaxing","graybeard","barrenest","Nevadans"])

julia> length(w1)
235886

julia> length(w2)
99171

julia> w1 ∩ w2
Set(UTF8String["confined","baleful","rearrangement","piecemeal","irreplaceable","shortbread","waster","null","staphylococcus","indelicacy"
 …
 
"uncut","boater","resell","joint","wavering","munitions","graybeard","treacherous","upsurge","oblique"])

julia> length(w1 ∩ w2)
35077

On a computer with a decent amount of RAM, dealing with data that's
hundreds of MB shouldn't be a problem.

On Mon, Feb 22, 2016 at 8:13 AM, barbara.g <barbara.gucc...@gmail.com>
wrote:

> Hi !
>
> I have just stepped into Julia, I didn't know about it (her...) before.
>
> I must handle plain text file, sized many hundreds of MB, and above, each;
> the goal is to remove duplicate lines, to find lines present in both of
> two, to find lines present in one and not in another, etc.
>
> I was used to do such operations in Mathematica because it perfectly fitts
> my needs when files are small enough to be loaded entirely in RAM, but when
> they grow up the jobs become practically infeacible.
>
> Can Julia provide an (the more or less) out the box solution or, at least,
> an easily programmable one ?
>
> Your sincerely !
>

Re: [julia-users] Plain text files manipulation

Reply via email to