Hi,

My goal is to have 10 parallel processes read the same file and each 
process consume 1/10th of that file. They of course all read all lines of 
the file, but skips over lines not belonging to it. So process #1 would 
process lines 1, 11, 21 etc. Second one would process lines 2, 12, 22 etc.  
The issue that I have has nothing to do with efficiency or performance of 
this, so let us forget the efficiency part of it, for now.

The code is checked in to a repository here 
*https://github.com/harikb/scratchpad1 
*(including some sample data), but also quoted in the email at the end.

$ julia --version
julia version 0.3.7

*# Input - some sample from nyc public database (see repo link above, but 
any file might be enough)*
$ wc -l nyc311calls.csv 
250000 nyc311calls.csv
*# Ignore why I am not using a csv reader. this is just test data. there 
are no multi-line quoted csv data here.*

$ julia -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl | wc -l 
250001
*# Non-parallel run, everything is fine. One extra line is the initial 
print statement from _driver.jl*

*# Now, let us run with 10 parallel processes*
$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl | 
wc -l
26420
$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl | 
wc -l
40915
$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl | 
wc -c
1919321
$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl | 
wc -c
2172839

*Output seems all over the place. I think the processes stop after reaching 
certain input.*

$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl | 
tail
        From worker 8:  Process 8 is processing line 46617
        From worker 5:  Process 5 is processing line 46614
        From worker 2:  Process 2 is processing line 50751
        From worker 4:  Process 4 is processing line 45593
        From worker 11: Process 11 is processing line 45380
        From worker 6:  Process 6 is processing line 46685
        From worker 7:  Process 7 is processing line 50756
        From worker 9:  Process 9 is processing line 46688
        From worker 10: Process 10 is processing line 46699
        From worker 3:  Process 3 is processing line 46692

Now, I could buy that the STDOUT is getting clobbered by multiple parallel 
writes to it.  I am used to STDOUT getting garbled/mixed data from other 
environments/languages, but I haven't seen missing data. The characters 
eventually make it in some form to the output.

But if I redirect the output to a file, it is perfectly fine every single 
time. Why is that STDOUT does not get clobbered in that case?

$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl > 
xx
$ wc -l xx
250001 xx
$ wc -c xx
12988916 xx
$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl > 
xx; wc -l xx
250001 xx
$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl > 
xx; wc -l xx
250001 xx
$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl > 
xx; wc -l xx
250001 xx
$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl > 
xx; wc -l xx
250001 xx
$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl > 
xx; wc -l xx
250001 xx
$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl > 
xx; wc -l xx
250001 xx


*== Code below same as the quoted github link ==*
$ cat ./julia_test_parallel.jl
#!/usr/local/julia-cb9bcae93a/bin/julia
function processOneFile(filename)

    np = nprocs()
    jump = np - 1
    jump = jump == 0 ? 1 : jump

    selfid = myid()

    # in a single-process setup, this function will be called for parent 
(id=1)
    assert(jump == 1 || selfid != 1)

    f = open(filename);
    offset = np == 1 ? selfid : selfid - 1
    lnum = 0
    for l in eachline(f)
        lnum += 1
        if lnum == offset
            println("Process $(selfid) is processing line $(lnum)")
            offset += jump
        end
    end
end

$ cat ./julia_test_parallel_driver.jl
#!/usr/local/julia-cb9bcae93a/bin/julia
filename = "nyc311calls.csv"
np = nprocs()
println("Started $(np) processes")
if (np > 1)
    if (myid() == 1)
        # Mulitprocess and I am the parent
        @sync begin
            for i = 2:nprocs()
                @async remotecall_wait(i, processOneFile, filename)
            end
        end
    end
else
    processOneFile(filename)
end

*Any help is appreciated.*

Thanks
--
Harry

Reply via email to