[julia-users] Parallel read and preprocess variable-length data files

Joshua Jones Tue, 31 May 2016 17:58:14 -0700

I'm looking for pointers on the best practice for parallel reading and 
preprocessing several files of variable-length time series data. I have 
little familiarity with parallelization, so it feels like I'm missing 
something obvious. I've read the Julia documentation and tried several 
approaches based on it, but with little success.


All pids are on one machine. Each file consists of a fixed-length header + 
NX points of variable precision data (NX varies from file to file). NX is 
stored as a value in each fixed-length header. 

I've come up with two clumsy solutions; the problem(s) are described below. 
Note that, with no parallelization, execution time is ~400% of numpy, 
mostly due to file read speed. 

   1. *pmap file read + preprocess*. Somewhat slow due to overhead of 
   moving data; slightly slower than numpy.
   2. *shared array, parallelize read + downsample*. I can achieve a 20% 
   speedup vs. numpy, but I've only found two ways to make this work.
      - A two-pass approach: open each file twice in a pair of pmap-style 
      @sync loops. 
      - first pass, loop over each file: open, get NX, get timestamp, 
         close, return. 
         - NY = round(Int, sum(NX)*fs_ratio); xx = SharedArray(Float64, 
         (NY,)); set variables for e.g. time indexing.
         - second pass, loop over each file: open, seek to data start, read 
         data, downsample into shared array, close.
         - A single-pass approach with stat():
   - estimate shared array size by summing 
         (stat(file).size-header_size)/(smallest_precision) over each file.
         - initialize shared array of NaNs, loop over each file to read 
         data, delete the leftover NaNs 
      - this runs into memory problems. The lowest precision of some data 
         formats overestimates NX by 8x (e.g. Int4 vs Int32). *
         
It seems like the fastest approach (by far) would be to leave IOStreams 
open (two-pass approach) or to read and preprocess each file into its own 
SharedArray (one-pass approach).

   - With the former approach, I don't know how to pass IOStreams between 
   workers. Is this even possible? I've flailed with approaches like an 
   Array{IOStream,1} filled with values created in my parallel kernel, but I 
   get Bad file descriptor errors when I try to read from any of the 
   resultant IOStreams again. For any stream, isopen(stream) returns true, 
   isreadable(stream) returns true, but the handle is set to @0x00...000. I 
   don't know a workaround, or if one exists.
   - With the latter, if data are stored in a SharedArray created in my 
   parallel kernel, accessing from myid() with e.g. sdata(s) gives #undef 
   for every value of s. What I've read suggests this is by design (same 
   basic problem as https://github.com/JuliaLang/julia/issues/13802 ). What am 
   I doing wrong here?
   

Thanks in advance for any suggestions you can offer.


* Aside, mostly: is there a documented/accepted "fastest" way to read a 
signed twos complement 4 bit integer?

[julia-users] Parallel read and preprocess variable-length data files

Reply via email to