[julia-users] Re: Parallelizing file processing using a sharedarray of Uint8's

Simon Kornblith Wed, 07 May 2014 12:20:20 -0700

Using different blocks should be fine. Unlike with DArrays, all processes 
have access to all of the memory. I believe localindexes etc. are just for 
convenience. Ideally, you would put the results of parsing the csv file 
directly into a SharedArray, but I'm not sure how easy that would be.

Using mmap_array ought to be faster, though. If you mmap the same file on 
all the workers, they're all supposed to share the same cache. To do this I 
believe you have to call mmap on each worker; I'm pretty sure that anything 
that serializes an mmapped array will copy it, which is not what you want.

If you plan to read the same file many times, you might also consider 
reading it once and then saving it as a JLD file. This should be more 
efficient, because binary data is smaller and doesn't need to be parsed. 
You can use the mmaparrays flag or the hyperslab interface if you only need 
to access chunks of the data at a time. Again, if you want to access chunks 
of the file from several workers on the same system, the best strategy is 
probably read the file using mmaparrays on each worker.

Simon

On Friday, May 2, 2014 5:53:09 PM UTC-4, Douglas Bates wrote:
>
> In another thread I mentioned looking at some largish data sets (10's of 
> gigabytes in the form of a .csv file).  I have made some progress treating 
> the file as a memory-mapped Uint8 array but that hasn't been as effective 
> as I would have hoped.  Using a shared array and multiple processes seems 
> an effective way to parallelize the initial reduction of the .csv file.
>
> The best way I have come up with of getting a file's contents as a shared 
> array is
>
> sm = convert(SharedArray, open(readbytes,"./kaggle/trainHistory.csv"))
>
> It would be convenient to process the contents on line boundaries.  I can 
> determine suitable ranges with something like
>
> function blocks(v::SharedVector{Uint8})
>     np = length(v.pids)
>     len = length(v)
>     bsz = div(len,np)
>     blks = Array(UnitRange{Int},np)
>     low = 1
>     for i in 1:np-1
>         eolpos = findnext(v, '\n', i*bsz)
>         blks[i] = UnitRange(low, eolpos)
>         low = eolpos + 1
>     end
>     blks[np] = UnitRange(low,len)
>     blks
> end
>
> which in this case produces
>
> julia> blocks(sm)
> 8-element Array{UnitRange{Int64},1}:
>  1:794390       
>  794391:1588775 
>  1588776:2383151
>  2383152:3177538
>  3177539:3971942
>  3971943:4766322
>  4766323:5560686
>  5560687:6355060
>
>
> (This is a smaller file that I am using for testing.  The real files are 
> much larger.)
>
> These blocks will be different from what I would get with 
> sm.loc_subarray_1d.  It seems to me that I should be able to use these 
> blocks rather than the .loc_subarray_1d blocks if I do enough juggling with 
> @spawnat, fetch, etc.  Is there anything that would stand in the way of 
> doing so?
>

[julia-users] Re: Parallelizing file processing using a sharedarray of Uint8's

Reply via email to