Hi MADlib Developers, To follow Ivan and Frank's suggestion, I am trying to propose the description and interface of Geographically weighted regression (GWR). PostGIS functions will be invoked to compute distance in some CRS and extract rectangle coordinates of study area. If MADlib doesn't have access to PostGIS routines, we can only implement some simple GIS utils with our own code . GWR models a local relationship of a numerical dependent variable to one or more explanatory independent variables to build a model of spatially varying relationships. It has been widely used for understanding the spatial pattern of natural or social phenomena . GWR constructs local equations seperately for each location in the table incorporating the dependent and independent variables falling within the bandwidth of each target geometry. The shape and extent of the bandwidth is dependent on the spatial kernel type( guass, exp and bisquare), distance in fixed methods ( or number of neighbors parameters in adpative methods ). Therefore, the computational burden of GWR increases with prediction locations. Parallelized GWR is necessary in high-performance environment such as GPDB. There are two important hints about GWR. Firstly, GWR can estimate coefficients in any locations but can only provide diagnostic information in observation locations. In addition, according to P ez et al.(2011), the basic GWR is not an appropriate method for small sample sizes (<160). Many advanced geographically-weighted methods are proposed in some papers (see Wheeler DC 2009, Brunsdon C et al. 2012,Gollini I et al. 2015) which are planned to implement in the future. The description about interface and function for GWR is also provided . Coefficients columns in output are seperated for easily mapping result in GIS. Can you kindly take a look and give me advice or feedback to improve it ? Many Thanks! Best,ChenLiang Wang -------------------------------------------------------------------------------------------------------------------------------------- The description about Geographically Weighted Regression (Spatial Statistics->Regression Models) Training Function of geographically weighted regression training function has the following syntax. gwregr_train(source_table, out_table, dependent_varname, independent_varname, kernel_params, adaptive_option, ftest_option, regression_location, prediction_location, grouping_cols, verbose ) ----------------------------------------------------------------------------------------------------------------------------------- Arguments source_table TEXT. The name of the table containing the training data. out_table TEXT. Name of the generated table containing the output model.
The output table contains the following columns. <...> Any grouping columns provided during training. Present only if the grouping option is used. coef_<independent_varname1>, coef_<independent_varname2> ... FLOAT8[]. Any columns corresponding to independent_varname of the vector of coefficients of the regression in each location. r2 FLOAT8. R-squared coefficient of determination of the model. adjr2 FLOAT8. Adjusted-R-squared coefficient of determination of the model. local_cond_no FLOAT8[]. The local condition number of GWR in each location (see Wheeler D2007) indicates when results are unstable due to local multicollinearity (above 30). F1_stats FLOAT8[]. The F-test array{F-statistic,Numerator DF,Denominator DF,p_value} for comparing Ordinary Linear Regression(OLR) and GWR models (see Leung et al. 2000) F2_stats FLOAT8[]. The F-test array{F-statistic,Numerator DF,Denominator DF,p_value} for comparing Ordinary Linear Regression(OLR) and GWR models (see Leung et al. 2000) F3_stats FLOAT8[]. The spatial stationary test statistic for GWR coefficients (see Leung et al. 2000) F3_ndf FLOAT8[]. The spatial stationary test Numerator DF for GWR coefficients (see Leung et al. 2000) F3_ddf FLOAT8[]. The spatial stationary test Denominator DF for GWR coefficients (see Leung et al. 2000) F3_pv FLOAT8[]. The spatial stationary test p_value for GWR coefficients (see Leung et al. 2000) F4_stats FLOAT8[]. The F-test array{F-statistic,Numerator DF,Denominator DF,p_value} for comparing Ordinary Linear Regression(OLR) and GWR models (see GWR book p92) num_missing_rows_skipped INTEGER. The number of rows that have NULL values in the dependent and independent variables, and were skipped in the computation for each group. A summary table named <out_table>_summary is created together with the output table. It has the following columns: source_table The data source table name out_table The output table name dependent_varname The dependent variable independent_varname The independent variables num_rows_processed The total number of rows that were used in the computation. num_missing_rows_skipped The total number of rows that were skipped because of NULL values in them. kernel_function The spatial kernel function bandwidth The bandwidth parameter adaptive_option The Boolean variable indicates whether to perform a adaptive kernel function. dependent_varname TEXT. Expression to evaluate for the dependent variable. independent_varname TEXT. Expression list to evaluate for the independent variables. An intercept variable is not assumed. It is common to provide an explicit intercept term by including a single constant 1 term in the independent variable list. kernel_params(optional) TEXT,default: 'kernel=guass,bw=CV', Parameters for kernel function. The kernel parameter is the name of the kernel function to use ‘gauss’: wgt = exp(-.5*(vdist/bw)^2); ‘exp’: wgt = exp(-vdist/bw); ‘bisquare’: wgt = (1-(vdist/bw)^2)^2 if vdist < bw, wgt=0 otherwise; Where,wgt indicates weight ,vdist indicates vector of distance, and bw indicates bandwidth. We can select either CV or AICc when you aren't sure what to use for the Distance or Number of neighbors parameter.We can also specify a numerical value for bw.If bw is large enough(above 1e7,for example), the estimation of coefficients in GWR is equal to the global estimation in ordinary linear regression. adaptive_option(optional) BOOLEAN,default:FALSE. When TRUE, an adaptive kernel is calculated where the bandwidth corresponds to the number of nearest neighbours (i.e. adaptive distance) ftest_option(optional) BOOLEAN,default:FALSE . When TRUE, three F-tests and spatial-stationary test of coefficients are also conducted and returned with the results according to Leung et al. (2000). regression_location 2D Point or Polygon Geometry, A geometry (usually 2D point geometry) representing locations where training should be conducted. The length of regression_location must be equal to the length of source_table.In most cases,it is a geometry field of source_table. prediction_location(optional) 2D Point or Polygon Geometry,default:regression_location. A geometry (usually 2D point geometry) representing locations where estimation of coefficients should be computed. grouping_cols (optional) TEXT, default: NULL. An expression list used to group the input dataset into discrete groups, running one regression per group. Similar to the SQL GROUP BY clause. When this value is null, no grouping is used and a single result model is generated. verbose(optional) BOOLEAN, default: FALSE. Provides verbose output of the results of training. --------------------------------------------------------------------------------------------------------------------------------------------- Prediction Function gwregr_predict(coef, col_ind,newdata_table) Arguments coef FLOAT8[][]. Vector of the coefficients of regression. col_ind FLOAT8[]. An array containing the independent variable column names. newdata_table(optional) TEXT. default: NULL. The name of table which provide new data in prediction locations. If prediction_location is same as regression_locations (default value) in training fucntion, this parameter is omitted automatically. Otherwise, newdata_table is obligatory to provide independent variables with identical field names in source_table in prediction locations . > Date: Fri, 18 Dec 2015 09:18:22 -0800 > Subject: Re: How to contribute a spatial module to MADlib manipulating > objects from PostGIS > From: fmcquil...@pivotal.io > To: dev@madlib.incubator.apache.org > > Thanks ChenLiang Wang for your interest. > > I would repeat Ivan's welcome to you, and I look forward to your > contributions in the area of GIS. > > To answer your questions: > > 1. Yes, it is possible to call PostGIS functions from MADlib. > > 2. Yes, spatial statistics are suitable for MADlib. > > For documentation, please refer to the Apache MADlib wiki > http://madlib.incubator.apache.org/ > > which includes: > Quick Start Guides > > Get going with a minimum of fuss. > > - Installation Guide > <https://cwiki.apache.org/confluence/display/MADLIB/Installation+Guide> > - Quick Start Guide for Users > > <https://cwiki.apache.org/confluence/display/MADLIB/Quick+Start+Guide+for+Users> > - Quick Start Guide for Developers > > <https://cwiki.apache.org/confluence/display/MADLIB/Quick+Start+Guide+for+Developers> > > > As Ivan mentioned, writing down the functions you would like to build and > the interface is a good place to begin. Then we can discuss on the open > mailing list. > > Regards, > Frank > > On Thu, Dec 17, 2015 at 8:11 PM, 王晨 亮 <hi181904...@msn.com> wrote: > > > Thanks for your quick reply. Your suggestion is great. I will give a > > definitions and description for the spatial statistic functions and > > comparison with ordinary statistic models. > > > > > > > Date: Thu, 17 Dec 2015 21:56:06 -0500 > > > Subject: Re: How to contribute a spatial module to MADlib manipulating > > objects from PostGIS > > > From: inov...@pivotal.io > > > To: dev@madlib.incubator.apache.org > > > > > > Hi ChenLiang, > > > > > > I think your proposal is good and worth trying to do it! > > > > > > Can I suggest the first steps if you send a proposal of the function > > > definitions and the parameters and return values as well as description > > of > > > the functions and what they do. > > > > > > Based on that we can discuss the design of the interface and once it > > looks > > > good you can start working on the actual implementation of the coding. > > > When you get to implementation we can help you on technical challenges. > > > > > > Cheers, > > > Ivan > > > > > > > > > > > > > > > > > > On Thu, Dec 17, 2015 at 9:50 PM, 王晨 亮 <hi181904...@msn.com> wrote: > > > > > > > Hi MADlib Developers, > > > > > > > > > > > > > > > > > > > > I am a GIS Researcher and have some knowledge on PostGIS, Python, > > > > C/C++,Java and R. > > > > > > > > > > > > > > > > I have learned some spatial statistical models during My PhD research > > in > > > > GIS. Recently, I have done a job translating GWR (Geographical Weighted > > > > Regression) from R into Java for my company. And I would like to > > > > contribute to MADLib if possible. I believe PostGIS and MADlib are the > > > > most powerful extensions of PostgreSQL . Therefore, a spatial > > statistical > > > > module connecting the two libraries could be significant . If I can > > start > > > > the task , the first goal to implement will be GWR model. > > > > > > > > > > > > > > > > Now I am reading the developer guide of MADlib. I not quite sure how to > > > > contribute a geospatial module to MADlib. Is it possible to manipulate > > > > spatial object or attribute from PostGIS in MADlib ? > > > > > > > > > > > > > > > > So could anyone suggest a few pointers & links that I can follow to get > > > > to know: > > > > > > > > > > > > > > > > 1. how to deal with these dependencies about MADlib? > > > > > > > > > > > > > > > > 2. whether the spatial statistics module is suitable for MADlib? > > > > > > > > > > > > > > > > Thank you in advance. > > > > > > > > > > > > ChenLiang Wang > > > > > > > > > > > >