Re: [R-sig-Geo] How to parallize my code?

ping yang Mon, 03 Feb 2014 09:20:20 -0800

Hi All,

The current situation is:


I have very high resolution gridded data (in NetCDF file, every year has
three netCDF files for tmin, tmax and prcp ) for 32 years, there is no
interpolation or downscaling process here, what I am doing is to extract
the weather information for each grid point and write them to a fixed with
format txt file. It is really high I/O and also high memory processing, for
one file it usually take several minutes(but there are hundred of thousand
files). If I divide all point into 5 parts and fire 5 R programs to run
simultaneously it will faster or I divide them into 10 part and then fire
10 R programs and it will be more faster. I think it should be a
possibility to parallel it.

I will try the suggested methods from here and post if I have a successful
experience.

Many thanks to the prompt replies.

Regards,

Ping


On Mon, Feb 3, 2014 at 10:46 AM, Jonathan Greenberg <[email protected]>wrote:

> Ping:
>
> First: For anyone getting started with parallel programming in R, you
> should really read:
> http://trg.apbionet.org/euasiagrid/docs/parallelR.notes.pdf
>
> Note that wherever it says "snow" you should swap in the package
> "parallel".  I'm a big proponent of the "foreach" package, which
> generalizes parallel processing across different parallel
> infrastructures, and more clearly follows "for" - type looping (*apply
> statement are, in general, much more difficult to understand).
>
> Second: if you are doing high I/O, low CPU processes (lots of
> read/write), in general parallel processing doesn't help unless you
> are on a RAIDed or a parallel filesystem.
>
> Third, as Julian points out: it is easier to start with what you are
> trying to do, rather than "fix my script".  What do you mean by:
>
> "The purpose is to generate weather information(tmin,tmim,prcp) for every 4
> KM (every 4KM there is a point represent that grid) for 32 years throughout
> Contingent U.S."
>
> ?
>
> Are you interpolating data?  Downscaling it?  Extracting statistics
> per-state?  Summarizing it?  All of these have very different
> interpretations, and different approaches (not all of which are
> parallel processing).
>
> --j
>
> On Mon, Feb 3, 2014 at 8:58 AM, Julian M. Burgos <[email protected]> wrote:
> > Yes, but why you are quering each point individually (one by one) in a
> > loop, and why you are writing every single data point in a different
> > file?  Loops are very slow in R, and reading/writing from/to a
> > disk file is also a very slow process.  Do it several hundred thousand
> > times and it is no surprise that this takes forever.  I do not have time
> to examine
> > your code either (next time provide a minimally reproducible example,
> > not a long script), but it seems that you have some netCDF files with
> > environmental data and a shapefile with locations, and you want to get
> > the environmental data for each location.  If this is what you want to
> > do, you should:
> >
> > a) Read the shapefiles into a SpatialPoints object.
> > b) Use the raster package and read the netCDF rasters into a rasterbrick
> > object.
> > c) Use the over function and do an overlay between the points and the
> > rasterbrick.  You will get a SpaitalPointsDataFrame object, with the
> > data associated to each point.
> > d) Save the data into a single file.
> >
> > This should take some time, say 20-30 minutes perhaps, maybe much less,
> depending on your
> > machine speed.
> >
> > Julian
> >
> >
> > ping yang writes:
> >
> >> Hi,
> >>
> >> The purpose is to generate weather information(tmin,tmim,prcp) for
> every 4
> >> KM (every 4KM there is a point represent that grid) for 32 years
> throughout
> >> Contingent U.S.
> >> I want to run several states simultaneously which I want it run them
> >> parallel.
> >>
> >> Thanks,
> >>
> >> Ping
> >>
> >>
> >> On Sun, Feb 2, 2014 at 1:45 PM, Julian Burgos <[email protected]> wrote:
> >>
> >>> A clear explanation of what you are trying to do would help.  I cannot
> >>> figure out why you need to generate such a huge amount of files!
>  Writing
> >>> data into a disk is a very slow step.
> >>>
> >>> > Hi r-sig-geo,
> >>> >
> >>> >
> >>> >
> >>> > I am using the following script to extraction weather information for
> >>> each
> >>> > point (from a rasterized shapefile for the COUS) which generate about
> >>> > 460,000 files:
> >>> >
> >>> > Originally I read the coordinates from a shapefile:
> >>> >
> >>> > Library(rgdal)
> >>> >
> >>> > pl <- readOGR(".","US_2_points")
> >>> >
> >>> >
> >>> >
> >>> > then I used the following code to extract the information from three
> >>> > NetCDF
> >>> > files that contains the tmin,tmax and pr for each year :
> >>> >
> >>> > for (year in 1981:2012)
> >>> >
> >>> >   {
> >>> >
> >>> >     #get the combined netCDF file
> >>> >
> >>> >     tminfile <- paste("tmin","_",year,".nc",sep='')
> >>> >
> >>> >     b_tmin <- brick(tminfile,varname='tmin')
> >>> >
> >>> >     pptfile <- paste("ppt","_",year,".nc",sep='')
> >>> >
> >>> >     b_ppt <- brick(pptfile,varname='ppt')
> >>> >
> >>> >     tmaxfile <- paste("tmax","_",year,".nc",sep='')
> >>> >
> >>> >     b_tmax <- brick(tmaxfile,varname='tmax')
> >>> >
> >>> >     #Get the first year here!!!
> >>> >
> >>> >     print(paste("processing year :",year,sep=''))
> >>> >
> >>> >     for(l in 1:length(pl))
> >>> >
> >>> >     {
> >>> >
> >>> >       v <- NULL
> >>> >
> >>> >       #generate file with the name convention with
> >>> > t_n(latitude)w(longitude).txt, 5 digits after point should be work
> >>> >
> >>> >       filename <-
> >>> >
> >>>
> paste("e:/PRISM/US3/N",round(coordinates(pl[l,])[2],5),"W",abs(round(coordinates(pl[l,])[1],5)),".wth",sep='')
> >>> >
> >>> >
> >>> >       print(paste("processing file :",filename,sep=''))
> >>> >
> >>> >       tmin <-
> >>> > as.numeric(round(extract(b_tmin,coordinates(pl[l,])),digits=1))
> >>> >
> >>> >       tmax <-
> >>> > as.numeric(round(extract(b_tmax,coordinates(pl[l,])),digits=1))
> >>> >
> >>> >       ppt <-
> >>> > as.numeric(round(extract(b_ppt,coordinates(pl[l,])),digits=2))
> >>> >
> >>> >       v <- cbind(tmax,tmin,ppt)
> >>> >
> >>> >       tablename <- c("tmin","tmax","ppt")
> >>> >
> >>> >       v <- data.frame(v)
> >>> >
> >>> >       colnames(v) <- tablename
> >>> >
> >>> >       v["default"] <- 0
> >>> >
> >>> >       v["year"] <- year
> >>> >
> >>> >       date <-
> >>> >
> >>>
> seq(as.Date(paste(year,"/1/1",sep='')),as.Date(paste(year,"/12/31",sep='')),"days")
> >>> >
> >>> >       month <- as.numeric(substr(date,6,7))
> >>> >
> >>> >       day   <- as.numeric(substr(date,9,10))
> >>> >
> >>> >       v["month"] <- month
> >>> >
> >>> >       v["day"]  <-  day
> >>> >
> >>> >       v <- v[c("year","month","day","default","tmin","tmax","ppt")]
> >>> >
> >>> >       #write into a file with the fixed wide format
> >>> >
> >>> >       if(file.existes(filename))
> >>> >
> >>> >       {
> >>> >
> >>> >
> >>> >
> >>>
> write.fwf(x=v,filename,append=TRUE,na="NA",rownames=FALSE,colnames=FALSE,width=c(6,3,3,5,5,5,6))
> >>> >
> >>> >       }
> >>> >
> >>> >       else
> >>> >
> >>> >       {
> >>> >
> >>> >
> >>> >
> >>>
> write.fwf(x=v,filename,append=FALSE,na="NA",rownames=FALSE,colnames=FALSE,width=c(6,3,3,5,5,5,6))
> >>> >
> >>> >       }
> >>> >
> >>> >
> >>> >
> >>> > I found it will takes weeks to finish, then I divide the points from
> the
> >>> > shape file into each State(to generate a shape file for each state
> for
> >>> the
> >>> > points), and using the following function:
> >>> >
> >>> >
> >>> >
> >>> > weather4Point <- function(state)
> >>> >
> >>> > {
> >>> >
> >>> >   #get the point file for the state
> >>> >
> >>> >   setwd("c:/PRISM/US1Points/")
> >>> >
> >>> >   pl <- readOGR(".",paste("points4_",state,sep=''))
> >>> >
> >>> >   setwd("e:/PRISM/NetCDF/")
> >>> >
> >>> >   for (year in 1981:2012)
> >>> >
> >>> >   {
> >>> >
> >>> >     #get the combined netCDF file
> >>> >
> >>> >     tminfile <- paste("tmin","_",year,".nc",sep='')
> >>> >
> >>> >     b_tmin <- brick(tminfile,varname='tmin')
> >>> >
> >>> >     pptfile <- paste("ppt","_",year,".nc",sep='')
> >>> >
> >>> >     b_ppt <- brick(pptfile,varname='ppt')
> >>> >
> >>> >     tmaxfile <- paste("tmax","_",year,".nc",sep='')
> >>> >
> >>> >     b_tmax <- brick(tmaxfile,varname='tmax')
> >>> >
> >>> >     #Get the first year here!!!
> >>> >
> >>> >     print(paste("processing year :",year,sep=''))
> >>> >
> >>> >     for(l in 1:length(pl))
> >>> >
> >>> >     {
> >>> >
> >>> >       v <- NULL
> >>> >
> >>> >       #generate file with the name convention with
> >>> > t_n(latitude)w(longitude).txt, 5 digits after point should be work
> >>> >
> >>> >       filename <-
> >>> >
> >>>
> paste("e:/PRISM/US3/N",round(coordinates(pl[l,])[2],5),"W",abs(round(coordinates(pl[l,])[1],5)),".wth",sep='')
> >>> >
> >>> >
> >>> >       print(paste("processing file :",filename,sep=''))
> >>> >
> >>> >       tmin <-
> >>> > as.numeric(round(extract(b_tmin,coordinates(pl[l,])),digits=1))
> >>> >
> >>> >       tmax <-
> >>> > as.numeric(round(extract(b_tmax,coordinates(pl[l,])),digits=1))
> >>> >
> >>> >       ppt <-
> >>> > as.numeric(round(extract(b_ppt,coordinates(pl[l,])),digits=2))
> >>> >
> >>> >       v <- cbind(tmax,tmin,ppt)
> >>> >
> >>> >       tablename <- c("tmin","tmax","ppt")
> >>> >
> >>> >       v <- data.frame(v)
> >>> >
> >>> >       colnames(v) <- tablename
> >>> >
> >>> >       v["default"] <- 0
> >>> >
> >>> >       v["year"] <- year
> >>> >
> >>> >       date <-
> >>> >
> >>>
> seq(as.Date(paste(year,"/1/1",sep='')),as.Date(paste(year,"/12/31",sep='')),"days")
> >>> >
> >>> >       month <- as.numeric(substr(date,6,7))
> >>> >
> >>> >       day   <- as.numeric(substr(date,9,10))
> >>> >
> >>> >       v["month"] <- month
> >>> >
> >>> >       v["day"]  <-  day
> >>> >
> >>> >       v <- v[c("year","month","day","default","tmin","tmax","ppt")]
> >>> >
> >>> >       v
> >>> >
> >>> >       #print(paste(v), zero.print = ".")
> >>> >
> >>> >       #write into a file with the APEX format
> >>> >
> >>> >       if(file.existes(filename))
> >>> >
> >>> >       {
> >>> >
> >>> >
> >>> >
> >>>
> write.fwf(x=v,filename,append=TRUE,na="NA",rownames=FALSE,colnames=FALSE,width=c(6,3,3,5,5,5,6))
> >>> >
> >>> >       }
> >>> >
> >>> >       else
> >>> >
> >>> >       {
> >>> >
> >>> >
> >>> >
> >>>
> write.fwf(x=v,filename,append=FALSE,na="NA",rownames=FALSE,colnames=FALSE,width=c(6,3,3,5,5,5,6))
> >>> >
> >>> >       }
> >>> >
> >>> >     }
> >>> >
> >>> >   }
> >>> >
> >>> > }
> >>> >
> >>> >
> >>> >
> >>> > Then using the following code by opening several R session to make
> this
> >>> > process "parallel".
> >>> >
> >>> >
> >>> >
> >>> > States <- c("AL","AZ","AR","CO","CT","DC","FL","GA","ID","IA",
> >>> >
> >>> >             "IN","KS","KY","LA","MA","ME","MI","MN","MT","NC")
> >>> >
> >>> > for (state in States)
> >>> >
> >>> > {
> >>> >
> >>> >   weather4Point(state)
> >>> >
> >>> > }
> >>> >
> >>> >
> >>> >
> >>> > However, I still feel slow when I plan to do for the whole US. I am
> >>> asking
> >>> > here is there a better solution to speed up this process(any
> possibility
> >>> > to
> >>> > employ parallel programming in R) ?
> >>> >
> >>> >
> >>> >
> >>> > Environment information:
> >>> >
> >>> > R version 3.0.1 (2013-05-16)
> >>> >
> >>> > Platform: x86_64-w64-mingw32/x64 (64-bit)
> >>> >
> >>> > Memory: 16 GB
> >>> >
> >>> > Processor: Intel(R) Core(tm) i7-3770 CPU @ 3.4Ghz 3.4Ghz
> >>> >
> >>> >
> >>> >
> >>> > Looking forward to any suggestions?
> >>> >
> >>> >
> >>> >
> >>> > Regards,
> >>> >
> >>> >
> >>> >
> >>> > Ping
> >>> >
> >>> >       [[alternative HTML version deleted]]
> >>> >
> >>> > _______________________________________________
> >>> > R-sig-Geo mailing list
> >>> > [email protected]
> >>> > https://stat.ethz.ch/mailman/listinfo/r-sig-geo
> >>> >
> >>>
> >>>
> >>>
> >
> >
> > --
> > Julian Mariano Burgos, PhD
> > Hafrannsóknastofnun/Marine Research Institute
> > Skúlagata 4, 121 Reykjavík, Iceland
> > Sími/Telephone : +354-5752037
> > Bréfsími/Telefax:  +354-5752001
> > Netfang/Email: [email protected]
> >
> > _______________________________________________
> > R-sig-Geo mailing list
> > [email protected]
> > https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>
>
>
> --
> Jonathan A. Greenberg, PhD
> Assistant Professor
> Global Environmental Analysis and Remote Sensing (GEARS) Laboratory
> Department of Geography and Geographic Information Science
> University of Illinois at Urbana-Champaign
> 259 Computing Applications Building, MC-150
> 605 East Springfield Avenue
> Champaign, IL  61820-6371
> Phone: 217-300-1924
> http://www.geog.illinois.edu/~jgrn/
> AIM: jgrn307, MSN: [email protected], Gchat: jgrn307, Skype: jgrn3007
>

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-Geo mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo

Re: [R-sig-Geo] How to parallize my code?

Reply via email to