Hi every one. I´m trying to parallelize a process that runs over the rows of a matrix. I expect for each element of the row, which is a species, that the process extracts and writes a file (raster) corresponding to the distribution of each species over their habitats.
Habitas layer is a raster file and each species distribution is a polygon (or group of polygons) from a shapefile. I first convert the species polygon(s) to a raster, second extract the habitats for the species (stored in a matrix where the species habitats codes are matched with the habitats raster values), and third intersect (multiply) the distribution and the habitats. In addition, I would like to produce a richness (number of species map) file (raster). Then, I add (sum) to an empty raster (with values of zero) each final species distribution. I wrote the following function: extract_habitats=function(k,spp_data,spp_polygons,sep,habitat_codes,cover){ #Libraries library(rgdal) library(raster) #raster file with zeros richness_cur=raster("richness_current.tif") #Selection of species polygons rows=as.numeric(which(as.character(spp_polygons@data$binomial)== as.character(spp_data$binomial[k]))) spp_poly=spp_polygons[rows,] #Covert polygon(s) to raster spp_poly=rasterize(spp_poly,cover,1,background=0) #Match species habitats codes with habitats raster values habs=as.character(spp_data$hab_code[k]) habs=unlist(strsplit(habs, split=sep))#habitat codes are separeted by a ";" cov_classes=as.numeric(as.character(habitat_codes[,2]#Get the hab [which(as.character(habitat_codes[,1])%in%habs)])) #Intersect species distributions with habitats cov_mask=spp_poly*cover #Extract species habitats cov_mask=Which(cov_mask%in%cov_classes) writeRaster(cov_mask,paste(spp_data$binomial[k]," current.tif",sep="")) #Sum species richness richness_cur=richness_cur+cov_mask return (richness_cur)} I try to parallelize the process using clusterApply and foreach functions. However, I couldn't return a raster object from the function (which is something obvious to get in a regular loop function) in any of both functions to add to that object the sum of the species richness. So, here is my first question. *1. Does anyone know how to return an object different from a list (I could save each raster inside a single list but no sum each one of them to the previos one), matrix, or vector in a parallelization process?* I solve this problem writing the richness file in each "iteration". Nevertheless, this option causes the process to be slower, so for me, it is not the ideal. Then, the function was rewritten as follows: extract_habitats=function(k,spp_data,spp_polygons,sep,habitat_codes,cover){ #Libraries library(rgdal) library(raster) #raster file with zeros richness_cur=raster("richness_current.tif") #Selection of species polygons rows=as.numeric(which(as.character(spp_polygons@data$binomial)== as.character(spp_data$binomial[k]))) spp_poly=spp_polygons[rows,] #Covert polygon(s) to raster spp_poly=rasterize(spp_poly,cover,1,background=0) #Match species habitats codes with habitats raster values habs=as.character(spp_data$hab_code[k]) habs=unlist(strsplit(habs, split=sep))#habitat codes are separeted by a ";" cov_classes=as.numeric(as.character(habitat_codes[,2]#Get the hab [which(as.character(habitat_codes[,1])%in%habs)])) #Intersect species distributions with habitats cov_mask=spp_poly*cover #Extract species habitats cov_mask=Which(cov_mask%in%cov_classes) writeRaster(cov_mask,paste(spp_data$binomial[k]," current.tif",sep="")) #Sum species richness richness_cur=richness_cur+cov_mask writeRaster(richness_cur,"richness_current.tif")} Complete code to run parallelization was: #Number of cores no_cores=detectCores()-1#Initiate cluster cl=makeCluster(no_cores,type="PSOCK") registerDoParallel(cl) #Table with name and habitat information (columns) for each species (rows) spp_data=read.xlsx2("species_file.xls",sheetIndex=1)#Shape file with species distributions as polygons spp_polygons=readOGR("distributions.shp")#Separation symbol for species habitats stored in spp_data sep=";"#Tabla joining habitas species codes with habitats raster habitat_codes=read.xlsx2("spp_habitats_final.xls",sheetIndex=1)#Habitats raster cover=raster("Z:/Data/cover_2015_proj_fixed_reclas_1km.tif") #Paralelization foreach(k=1:nrow(spp_data)) %dopar% extract_habitats(k=k, spp_data=spp_data, spp_polygons=spp_polygons,sep=sep, habitat_codes=habitat_codes, cover=cover) stopImplicitCluster() stopCluster(cl) The parallelization process runs; however, it didn't works how I expected, since it is not using all the cores: Image of processors working <https://i.stack.imgur.com/m4h5y.jpg>. So, what the parallelization process does, is to start 39 (the number of cores) processes: Image of processes opened <https://i.stack.imgur.com/nDkX4.jpg>, but it does't write one by one the files, how I could expect in a regular loop. It suddenly write blocks of 39 files (something that I can understand), but taking a lot of time (because it seems to work in few cores), even more than if I ran a regular loop (running the regular loop each file is written each two or three minutes, while the block of 39 files is written approximately each one hour). *So, here is my second group of questions. 2. What I´m doing bad? 3. Why it is not using all the 39 processors, or if it uses them, why it doesn't use them at the maximum level? 4. Why it not begins another task (run an other process, e.g., the process 40) when it finishes one (I suppose it because it always write the files in blocks of 39)?* Thanks in advance for your help. Cheers, Jaime -- Jaime Burbano Girón Estudiante Doctoral Doctorado en Estudios Ambientales y Rurales Pontificia Universidad Javeriana Bogotá, Colombia (57-1) 3208320 Ext. 4844 [[alternative HTML version deleted]] _______________________________________________ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology