Using different blocks should be fine. Unlike with DArrays, all processes have access to all of the memory. I believe localindexes etc. are just for convenience. Ideally, you would put the results of parsing the csv file directly into a SharedArray, but I'm not sure how easy that would be.
Using mmap_array ought to be faster, though. If you mmap the same file on all the workers, they're all supposed to share the same cache. To do this I believe you have to call mmap on each worker; I'm pretty sure that anything that serializes an mmapped array will copy it, which is not what you want. If you plan to read the same file many times, you might also consider reading it once and then saving it as a JLD file. This should be more efficient, because binary data is smaller and doesn't need to be parsed. You can use the mmaparrays flag or the hyperslab interface if you only need to access chunks of the data at a time. Again, if you want to access chunks of the file from several workers on the same system, the best strategy is probably read the file using mmaparrays on each worker. Simon On Friday, May 2, 2014 5:53:09 PM UTC-4, Douglas Bates wrote: > > In another thread I mentioned looking at some largish data sets (10's of > gigabytes in the form of a .csv file). I have made some progress treating > the file as a memory-mapped Uint8 array but that hasn't been as effective > as I would have hoped. Using a shared array and multiple processes seems > an effective way to parallelize the initial reduction of the .csv file. > > The best way I have come up with of getting a file's contents as a shared > array is > > sm = convert(SharedArray, open(readbytes,"./kaggle/trainHistory.csv")) > > It would be convenient to process the contents on line boundaries. I can > determine suitable ranges with something like > > function blocks(v::SharedVector{Uint8}) > np = length(v.pids) > len = length(v) > bsz = div(len,np) > blks = Array(UnitRange{Int},np) > low = 1 > for i in 1:np-1 > eolpos = findnext(v, '\n', i*bsz) > blks[i] = UnitRange(low, eolpos) > low = eolpos + 1 > end > blks[np] = UnitRange(low,len) > blks > end > > which in this case produces > > julia> blocks(sm) > 8-element Array{UnitRange{Int64},1}: > 1:794390 > 794391:1588775 > 1588776:2383151 > 2383152:3177538 > 3177539:3971942 > 3971943:4766322 > 4766323:5560686 > 5560687:6355060 > > > (This is a smaller file that I am using for testing. The real files are > much larger.) > > These blocks will be different from what I would get with > sm.loc_subarray_1d. It seems to me that I should be able to use these > blocks rather than the .loc_subarray_1d blocks if I do enough juggling with > @spawnat, fetch, etc. Is there anything that would stand in the way of > doing so? >