Moritz Lennert wrote: > >> I don't know how to debug this... > > > > Can you identify a repeatable test case? > > > > If I could make it happen, I could debug it. > > You can get a location names TEST here: > > http://tomahawk.ulb.ac.be/moritz/mask_bug_testlocation.tgz > > This contains only a PERMANENT mapset. > > In that mapset, launch the following command: > > r.mask vect=hull; for map in $(g.list rast pat="firm_rate*"); do echo > $map ; r.mapcalc "temp_prob = float($map) / sum_rates" --o --q; done; > r.mask -r > > I get the error arbitrarily for different firm_rate_* maps, sometimes > only for one, sometimes for many, but at each run its for different > maps.
So it's non-deterministic (I'm getting one error for every 10-20 passes over the data, i.e. every 1200-2500 commands), and only applies to r.mapcalc. My first guess was a race condition related to pthreads. I tried export WORKERS=0 before running the test, and it hasn't happened since. And actually I'm now fairly certain as to the specific cause. When compiled with pthread support, r.mapcalc has a mutex for each map to prevent concurrent access to a single map from multiple threads. Concurrent access to different maps (and to core lib/gis and and lib/raster functionality) from different threads is supposed to be safe (see r34485 and the interval surrounding it), but the MASK was overlooked. If a MASK is in use, reading a row from any raster map will read the corresponding row from the MASK, and there's nothing to prevent different threads from concurrently accessing two different maps and thus accessing the MASK. So, in read_data_{compressed,uncompressed,read_data_fp_compressed} in lib/raster/get_row.c we have code like: if (lseek(fcb->data_fd, (off_t) row * bufsize, SEEK_SET) == -1) G_fatal_error(_("Error reading raster data for row %d of <%s>"), row, fcb->name); if (read(fcb->data_fd, data_buf, bufsize) != bufsize) G_fatal_error(_("Error reading raster data for row %d of <%s>"), row, fcb->name); If multiple threads execute this code concurrently, you can end up with the calls being interleaved like so: Thread 1 Thread 2 lseek lseek read read meaning that the file offset has changed betwee the lseek() and the read() (this is why X/Open and POSIX added pread(), but that's still relatively new). This only results in an error at the end of the file (the first read() will leave the file offset at EOF, so the second read() fails), but in other situations it's likely causing the wrong row of the MASK to be read. A possible quick fix: if (R__.auto_mask > 0) putenv("WORKERS=0"); A slightly better fix would be to check for masking and if it's enabled, have a single mutex which guards *all* raster reads so that even concurrent access to different maps is blocked. Unlike the above hack, this still allows computations to be executed in parallel. Better still would be to guard access to the MASK so that the other aspects of raster input can be parallelised (raster I/O is still a major bottleneck, and mostly because of processing rather than actual disc access). But that would involve either adding pthread code directly into the base raster input code in lib/raster/get_row.c (undesirable) or at least adding a mechanism to allow r.mapcalc to hook into it to provide the mutex. -- Glynn Clements <gl...@gclements.plus.com> _______________________________________________ grass-dev mailing list grass-dev@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/grass-dev