Hi All, The current situation is:
I have very high resolution gridded data (in NetCDF file, every year has three netCDF files for tmin, tmax and prcp ) for 32 years, there is no interpolation or downscaling process here, what I am doing is to extract the weather information for each grid point and write them to a fixed with format txt file. It is really high I/O and also high memory processing, for one file it usually take several minutes(but there are hundred of thousand files). If I divide all point into 5 parts and fire 5 R programs to run simultaneously it will faster or I divide them into 10 part and then fire 10 R programs and it will be more faster. I think it should be a possibility to parallel it. I will try the suggested methods from here and post if I have a successful experience. Many thanks to the prompt replies. Regards, Ping On Mon, Feb 3, 2014 at 10:46 AM, Jonathan Greenberg <[email protected]>wrote: > Ping: > > First: For anyone getting started with parallel programming in R, you > should really read: > http://trg.apbionet.org/euasiagrid/docs/parallelR.notes.pdf > > Note that wherever it says "snow" you should swap in the package > "parallel". I'm a big proponent of the "foreach" package, which > generalizes parallel processing across different parallel > infrastructures, and more clearly follows "for" - type looping (*apply > statement are, in general, much more difficult to understand). > > Second: if you are doing high I/O, low CPU processes (lots of > read/write), in general parallel processing doesn't help unless you > are on a RAIDed or a parallel filesystem. > > Third, as Julian points out: it is easier to start with what you are > trying to do, rather than "fix my script". What do you mean by: > > "The purpose is to generate weather information(tmin,tmim,prcp) for every 4 > KM (every 4KM there is a point represent that grid) for 32 years throughout > Contingent U.S." > > ? > > Are you interpolating data? Downscaling it? Extracting statistics > per-state? Summarizing it? All of these have very different > interpretations, and different approaches (not all of which are > parallel processing). > > --j > > On Mon, Feb 3, 2014 at 8:58 AM, Julian M. Burgos <[email protected]> wrote: > > Yes, but why you are quering each point individually (one by one) in a > > loop, and why you are writing every single data point in a different > > file? Loops are very slow in R, and reading/writing from/to a > > disk file is also a very slow process. Do it several hundred thousand > > times and it is no surprise that this takes forever. I do not have time > to examine > > your code either (next time provide a minimally reproducible example, > > not a long script), but it seems that you have some netCDF files with > > environmental data and a shapefile with locations, and you want to get > > the environmental data for each location. If this is what you want to > > do, you should: > > > > a) Read the shapefiles into a SpatialPoints object. > > b) Use the raster package and read the netCDF rasters into a rasterbrick > > object. > > c) Use the over function and do an overlay between the points and the > > rasterbrick. You will get a SpaitalPointsDataFrame object, with the > > data associated to each point. > > d) Save the data into a single file. > > > > This should take some time, say 20-30 minutes perhaps, maybe much less, > depending on your > > machine speed. > > > > Julian > > > > > > ping yang writes: > > > >> Hi, > >> > >> The purpose is to generate weather information(tmin,tmim,prcp) for > every 4 > >> KM (every 4KM there is a point represent that grid) for 32 years > throughout > >> Contingent U.S. > >> I want to run several states simultaneously which I want it run them > >> parallel. > >> > >> Thanks, > >> > >> Ping > >> > >> > >> On Sun, Feb 2, 2014 at 1:45 PM, Julian Burgos <[email protected]> wrote: > >> > >>> A clear explanation of what you are trying to do would help. I cannot > >>> figure out why you need to generate such a huge amount of files! > Writing > >>> data into a disk is a very slow step. > >>> > >>> > Hi r-sig-geo, > >>> > > >>> > > >>> > > >>> > I am using the following script to extraction weather information for > >>> each > >>> > point (from a rasterized shapefile for the COUS) which generate about > >>> > 460,000 files: > >>> > > >>> > Originally I read the coordinates from a shapefile: > >>> > > >>> > Library(rgdal) > >>> > > >>> > pl <- readOGR(".","US_2_points") > >>> > > >>> > > >>> > > >>> > then I used the following code to extract the information from three > >>> > NetCDF > >>> > files that contains the tmin,tmax and pr for each year : > >>> > > >>> > for (year in 1981:2012) > >>> > > >>> > { > >>> > > >>> > #get the combined netCDF file > >>> > > >>> > tminfile <- paste("tmin","_",year,".nc",sep='') > >>> > > >>> > b_tmin <- brick(tminfile,varname='tmin') > >>> > > >>> > pptfile <- paste("ppt","_",year,".nc",sep='') > >>> > > >>> > b_ppt <- brick(pptfile,varname='ppt') > >>> > > >>> > tmaxfile <- paste("tmax","_",year,".nc",sep='') > >>> > > >>> > b_tmax <- brick(tmaxfile,varname='tmax') > >>> > > >>> > #Get the first year here!!! > >>> > > >>> > print(paste("processing year :",year,sep='')) > >>> > > >>> > for(l in 1:length(pl)) > >>> > > >>> > { > >>> > > >>> > v <- NULL > >>> > > >>> > #generate file with the name convention with > >>> > t_n(latitude)w(longitude).txt, 5 digits after point should be work > >>> > > >>> > filename <- > >>> > > >>> > paste("e:/PRISM/US3/N",round(coordinates(pl[l,])[2],5),"W",abs(round(coordinates(pl[l,])[1],5)),".wth",sep='') > >>> > > >>> > > >>> > print(paste("processing file :",filename,sep='')) > >>> > > >>> > tmin <- > >>> > as.numeric(round(extract(b_tmin,coordinates(pl[l,])),digits=1)) > >>> > > >>> > tmax <- > >>> > as.numeric(round(extract(b_tmax,coordinates(pl[l,])),digits=1)) > >>> > > >>> > ppt <- > >>> > as.numeric(round(extract(b_ppt,coordinates(pl[l,])),digits=2)) > >>> > > >>> > v <- cbind(tmax,tmin,ppt) > >>> > > >>> > tablename <- c("tmin","tmax","ppt") > >>> > > >>> > v <- data.frame(v) > >>> > > >>> > colnames(v) <- tablename > >>> > > >>> > v["default"] <- 0 > >>> > > >>> > v["year"] <- year > >>> > > >>> > date <- > >>> > > >>> > seq(as.Date(paste(year,"/1/1",sep='')),as.Date(paste(year,"/12/31",sep='')),"days") > >>> > > >>> > month <- as.numeric(substr(date,6,7)) > >>> > > >>> > day <- as.numeric(substr(date,9,10)) > >>> > > >>> > v["month"] <- month > >>> > > >>> > v["day"] <- day > >>> > > >>> > v <- v[c("year","month","day","default","tmin","tmax","ppt")] > >>> > > >>> > #write into a file with the fixed wide format > >>> > > >>> > if(file.existes(filename)) > >>> > > >>> > { > >>> > > >>> > > >>> > > >>> > write.fwf(x=v,filename,append=TRUE,na="NA",rownames=FALSE,colnames=FALSE,width=c(6,3,3,5,5,5,6)) > >>> > > >>> > } > >>> > > >>> > else > >>> > > >>> > { > >>> > > >>> > > >>> > > >>> > write.fwf(x=v,filename,append=FALSE,na="NA",rownames=FALSE,colnames=FALSE,width=c(6,3,3,5,5,5,6)) > >>> > > >>> > } > >>> > > >>> > > >>> > > >>> > I found it will takes weeks to finish, then I divide the points from > the > >>> > shape file into each State(to generate a shape file for each state > for > >>> the > >>> > points), and using the following function: > >>> > > >>> > > >>> > > >>> > weather4Point <- function(state) > >>> > > >>> > { > >>> > > >>> > #get the point file for the state > >>> > > >>> > setwd("c:/PRISM/US1Points/") > >>> > > >>> > pl <- readOGR(".",paste("points4_",state,sep='')) > >>> > > >>> > setwd("e:/PRISM/NetCDF/") > >>> > > >>> > for (year in 1981:2012) > >>> > > >>> > { > >>> > > >>> > #get the combined netCDF file > >>> > > >>> > tminfile <- paste("tmin","_",year,".nc",sep='') > >>> > > >>> > b_tmin <- brick(tminfile,varname='tmin') > >>> > > >>> > pptfile <- paste("ppt","_",year,".nc",sep='') > >>> > > >>> > b_ppt <- brick(pptfile,varname='ppt') > >>> > > >>> > tmaxfile <- paste("tmax","_",year,".nc",sep='') > >>> > > >>> > b_tmax <- brick(tmaxfile,varname='tmax') > >>> > > >>> > #Get the first year here!!! > >>> > > >>> > print(paste("processing year :",year,sep='')) > >>> > > >>> > for(l in 1:length(pl)) > >>> > > >>> > { > >>> > > >>> > v <- NULL > >>> > > >>> > #generate file with the name convention with > >>> > t_n(latitude)w(longitude).txt, 5 digits after point should be work > >>> > > >>> > filename <- > >>> > > >>> > paste("e:/PRISM/US3/N",round(coordinates(pl[l,])[2],5),"W",abs(round(coordinates(pl[l,])[1],5)),".wth",sep='') > >>> > > >>> > > >>> > print(paste("processing file :",filename,sep='')) > >>> > > >>> > tmin <- > >>> > as.numeric(round(extract(b_tmin,coordinates(pl[l,])),digits=1)) > >>> > > >>> > tmax <- > >>> > as.numeric(round(extract(b_tmax,coordinates(pl[l,])),digits=1)) > >>> > > >>> > ppt <- > >>> > as.numeric(round(extract(b_ppt,coordinates(pl[l,])),digits=2)) > >>> > > >>> > v <- cbind(tmax,tmin,ppt) > >>> > > >>> > tablename <- c("tmin","tmax","ppt") > >>> > > >>> > v <- data.frame(v) > >>> > > >>> > colnames(v) <- tablename > >>> > > >>> > v["default"] <- 0 > >>> > > >>> > v["year"] <- year > >>> > > >>> > date <- > >>> > > >>> > seq(as.Date(paste(year,"/1/1",sep='')),as.Date(paste(year,"/12/31",sep='')),"days") > >>> > > >>> > month <- as.numeric(substr(date,6,7)) > >>> > > >>> > day <- as.numeric(substr(date,9,10)) > >>> > > >>> > v["month"] <- month > >>> > > >>> > v["day"] <- day > >>> > > >>> > v <- v[c("year","month","day","default","tmin","tmax","ppt")] > >>> > > >>> > v > >>> > > >>> > #print(paste(v), zero.print = ".") > >>> > > >>> > #write into a file with the APEX format > >>> > > >>> > if(file.existes(filename)) > >>> > > >>> > { > >>> > > >>> > > >>> > > >>> > write.fwf(x=v,filename,append=TRUE,na="NA",rownames=FALSE,colnames=FALSE,width=c(6,3,3,5,5,5,6)) > >>> > > >>> > } > >>> > > >>> > else > >>> > > >>> > { > >>> > > >>> > > >>> > > >>> > write.fwf(x=v,filename,append=FALSE,na="NA",rownames=FALSE,colnames=FALSE,width=c(6,3,3,5,5,5,6)) > >>> > > >>> > } > >>> > > >>> > } > >>> > > >>> > } > >>> > > >>> > } > >>> > > >>> > > >>> > > >>> > Then using the following code by opening several R session to make > this > >>> > process "parallel". > >>> > > >>> > > >>> > > >>> > States <- c("AL","AZ","AR","CO","CT","DC","FL","GA","ID","IA", > >>> > > >>> > "IN","KS","KY","LA","MA","ME","MI","MN","MT","NC") > >>> > > >>> > for (state in States) > >>> > > >>> > { > >>> > > >>> > weather4Point(state) > >>> > > >>> > } > >>> > > >>> > > >>> > > >>> > However, I still feel slow when I plan to do for the whole US. I am > >>> asking > >>> > here is there a better solution to speed up this process(any > possibility > >>> > to > >>> > employ parallel programming in R) ? > >>> > > >>> > > >>> > > >>> > Environment information: > >>> > > >>> > R version 3.0.1 (2013-05-16) > >>> > > >>> > Platform: x86_64-w64-mingw32/x64 (64-bit) > >>> > > >>> > Memory: 16 GB > >>> > > >>> > Processor: Intel(R) Core(tm) i7-3770 CPU @ 3.4Ghz 3.4Ghz > >>> > > >>> > > >>> > > >>> > Looking forward to any suggestions? > >>> > > >>> > > >>> > > >>> > Regards, > >>> > > >>> > > >>> > > >>> > Ping > >>> > > >>> > [[alternative HTML version deleted]] > >>> > > >>> > _______________________________________________ > >>> > R-sig-Geo mailing list > >>> > [email protected] > >>> > https://stat.ethz.ch/mailman/listinfo/r-sig-geo > >>> > > >>> > >>> > >>> > > > > > > -- > > Julian Mariano Burgos, PhD > > Hafrannsóknastofnun/Marine Research Institute > > Skúlagata 4, 121 Reykjavík, Iceland > > Sími/Telephone : +354-5752037 > > Bréfsími/Telefax: +354-5752001 > > Netfang/Email: [email protected] > > > > _______________________________________________ > > R-sig-Geo mailing list > > [email protected] > > https://stat.ethz.ch/mailman/listinfo/r-sig-geo > > > > -- > Jonathan A. Greenberg, PhD > Assistant Professor > Global Environmental Analysis and Remote Sensing (GEARS) Laboratory > Department of Geography and Geographic Information Science > University of Illinois at Urbana-Champaign > 259 Computing Applications Building, MC-150 > 605 East Springfield Avenue > Champaign, IL 61820-6371 > Phone: 217-300-1924 > http://www.geog.illinois.edu/~jgrn/ > AIM: jgrn307, MSN: [email protected], Gchat: jgrn307, Skype: jgrn3007 > [[alternative HTML version deleted]]
_______________________________________________ R-sig-Geo mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/r-sig-geo
