Dear Roger, Thank you again for your answer. You're really helping me a lot.
I was wondering how to deal with the scales for lme4::lmer, e.g. I have the variable income_per_capita which has the income_per_capita for the zipcode in which the listing/ad is located. Is it possible to add a income_per_capita_zipcode and income_per_capita_borough for the different scales/level for the multilevel model? lme4::lmer(log_price ~ factor(room_type) + bedrooms + bathrooms + guests_included + minimum_nights + distance_downtown + distance_subway + number_of_reviews + review_scores_cleanliness + professional_host + host_is_superhost + is_business_travel_ready + offense_misdemeanor + offense_felony + income_per_capita + factor(date_compiled) + (1 | borough / zipcode / id), data = listings) Thank you and best regards ________________________________________ From: Roger Bivand <roger.biv...@nhh.no> Sent: Monday, December 9, 2019 12:14 To: Robert R Cc: r-sig-geo@r-project.org Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method On Sun, 8 Dec 2019, Robert R wrote: > Dear Roger, > > Thank you for your answer. Regarding the sparse matrix, you're right - I > tested creating one, as follows: > > ##### > > listings_nb <- listings %>% spdep::poly2nb(queen = TRUE) %>% > spdep::include.self() # is.symmetric.nb(listings_nb) Are listings point or polygon support (must be polygon for this function)? > > nb_B <- spdep::nb2listw(neighbours = listings_nb, style="B", zero.policy > = FALSE) > > B <- as(nb_B , "CsparseMatrix") > all(B == t(B)) > > nb_B1 <- mat2listw(as(B, "dgTMatrix")) > > format(object.size(nb_B), units = "Mb") > # 85.6 Mb > > format(object.size(nb_B1), units = "Mb") > # 85.6 Mb > > ##### > > The size for both objects is the same - different from this example: > https://cran.r-project.org/web/packages/spdep/vignettes/nb_igraph.html > > > Summing up the data that I have: a "picture" for Airbnb's listings/ads > once a month for NYC, incl. lat, lon, price (per night), id, and some > characteristics from the listing/ad as number of rooms, bedrooms, guests > included, etc. for a period of 54 months. The data was taken here: > http://insideairbnb.com/get-the-data.html (listings.csv.gz) > > For the OLS, I used a pooled OLS with time dummy fixed effects > (date_compiled, when the "picture" was compiled - how Airbnb listings > for NYC were shown) because many of my observations (listings id) do not > repeat for many periods. Also, many listings changed the room_type at > least once during the whole time period analyzed (3 types of room_type: > entire home/apt, private room, shared room). > OK, I suppose. > I am now trying a random intercepts three-level hierarchy multilevel > model, where _id_ (level 1) are nested within _zipcode_ (level 2), and > the last is nested within _borough_ (level 3). So groups: > id:(zipcode:borough). > I'd try and see what happens, and watch machine resources (memory anyway). > lme4::lmer(log_price ~ factor(room_type) + bedrooms + bathrooms + > guests_included + minimum_nights + distance_downtown + distance_subway + > number_of_reviews + review_scores_cleanliness + professional_host + > host_is_superhost + is_business_travel_ready + offense_misdemeanor + > offense_felony + income_per_capita + factor(date_compiled) + (1 | > borough / zipcode / id), data = listings) > > Roger, do you think it is okay to factor(room_type) (for the 3 types of > room) and factor(date_compiled) for the dates when the NYC Airbnb's > listings/ads were extracted? Do you see structured variability coming from these - if so, inclusion makes sense. Roger > > Thank you and best regards, > Robert > > ________________________________________ > From: Roger Bivand <roger.biv...@nhh.no> > Sent: Wednesday, December 4, 2019 09:07 > To: Robert R > Cc: r-sig-geo@r-project.org > Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method > > On Wed, 4 Dec 2019, Robert R wrote: > >> Dear Roger, >> >> Again, thank you for your answer. What do you mean by "zip code random >> effect"? You mean I should use in plm the model "random"? >> >> regression_re <- plm(formula = model, data = listings, model = "random", >> index = c("id", "date_compiled")) > > No, obviously not, your data are not a balanced panel. I mean a multilevel > model, where the <200 zip codes cluster the data, and where a zip code > level IID RE will almost certainly do a better job than dummies. An > MRF/ICAR RE might be an extension. > >> >> And any other methodology in dealing with large weight matrices in >> spatialreg::lagsarlm? > > Please refer to Bivand et al. (2013) refered to in the package. Probably > the weights would need to be symmetric and very sparse. > > I still think that you should focus on a small subset of the data and to > improving the signal-noise ratio before trying to scale up. > > Roger > >> >> Thank you and best regards, >> Robert >> >> ________________________________________ >> From: Roger Bivand <roger.biv...@nhh.no> >> Sent: Wednesday, November 27, 2019 13:53 >> To: Robert R >> Cc: r-sig-geo@r-project.org >> Subject: SV: [R-sig-Geo] Spatial Autocorrelation Estimation Method >> >> Yes this is expected, since the # neighbours in a single zip code block is a >> dense matrix, and there will be multiple such matrices. (15000^2)*8 is >> 1.8e+09 so such a dense matrix will max out your RAM. There is no way to >> look at block neighbours in that format without subsetting your data (think >> train/test), use a zip code random effect. I would certainly drop all >> attempts to examine spatial dependency until you get an aspatial multilevel >> hedonic model working. >> >> Roger >> >> -- >> Roger Bivand >> Norwegian School of Economics >> Helleveien 30, 5045 Bergen, Norway >> roger.biv...@nhh.no >> >> >> ________________________________________ >> Fra: Robert R <userca...@outlook.com> >> Sendt: tirsdag 26. november 2019 21.04 >> Til: Roger Bivand >> Kopi: r-sig-geo@r-project.org >> Emne: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method >> >> Dear Roger, >> >> Thank you for your e-mail. Actually there is less noise that it seems. >> Rental prices are daily rental prices and I have an extract of all Airbnb >> listings daily prices once a month for a period of 4 years. Each listings >> information contains the lat, lon, number of bedrooms, category (entire >> home/apt, shared room or private room), etc. >> >> One question regarding the spdep::nb2blocknb function: it runs super fast >> with up to n = 1000, and always crashes my R session with n = 15000 or so. >> Is there an alternative to solve this problem? >> >> Thank you and best regards, >> Robert >> >> ________________________________________ >> From: Roger Bivand <roger.biv...@nhh.no> >> Sent: Tuesday, November 26, 2019 20:48 >> To: Robert R >> Cc: r-sig-geo@r-project.org >> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method >> >> Sorry for late reply, am indisposed and unable to help further. I feel >> that there is so much noise in your data (differences in offers, rental >> lengths, repeats or not, etc.), that you will certainly have to subset >> vigorously first to isolate response cases that are comparable. What you >> are trying to disentangle are the hedonic components in the bundle where >> you just have price as response, but lots of other bundle characteristics >> on the right hand side (days, etc.). I feel you'd need to try to get to a >> response index of price per day per rental area or some such. I'd >> certainly advise examining responses to a specific driver (major concert >> or sports event) to get a feel for how the market responds, and return to >> spatial hedonic after finding an approach that gives reasonable aspatial >> outcomes. >> >> Roger >> >> On Sun, 17 Nov 2019, Robert R wrote: >> >>> Dear Roger, >>> >>> Thank you for your message and sorry for my late answer. >>> >>> Regarding the number of listings (lettings) for my data set (2.216.642 >>> observations), each listing contains an individual id: >>> >>> unique ids: 180.004 >>> time periods: 54 (2015-01 to 2019-09) >>> number of ids that appear only once: 28.486 (of 180.004 ids) (15,8%) >>> number of ids that appear/repeat 2-10 times: 82.641 (of 180.004 ids) (45,9%) >>> number of ids that appear/repeat 11-30 times: 46.465 (of 180.004 ids) >>> (25,8%) >>> number of ids that appear/repeat 31-54 times: 22.412 (of 180.004 ids) >>> (12,5%) >>> >>> Important to notice is that hosts can change the room_category (between >>> entire/home apt, private room and shared room) keeping the same listing id >>> number. In my data, the number of unique ids that in some point changed the >>> room_type is of 7.204 ids. >>> >>> -- >>> >>> For the OLS model, I was using only a fixed effect model, where each time >>> period (date_compiled) (54 in total) is a time dummy. >>> >>> plm::plm(formula = model, data = listings, model = "pooling", index = >>> c("id", "date_compiled")) >>> >>> >>> -- >>> Osland et al. (2016) (https://doi.org/10.1111/jors.12281) use a spatial >>> fixed effects (SFE) hedonic model, where each defined neighborhood zone in >>> the study area is represented by dummy variables. >>> >>> Dong et al. (2015) (https://doi.org/10.1111/gean.12049) outline four model >>> specifications to accommodate geographically hierarchical data structures: >>> (1) groupwise W and fixed regional effects; (2) groupwise W and random >>> regional effects; (3) proximity-based W and fixed regional effects; and (4) >>> proximity-based W and random regional effects. >>> -- >>> >>> I created a new column/variable containing the borough where the zipcode is >>> found (Manhattan, Brooklyn, Queens, Bronx, Staten Island). >>> >>> If I understood it right, the (two-level) Hierarchical Spatial Simultaneous >>> Autoregressive Model (HSAR) considers the occurrence of spatial relations >>> at the (lower) individual (geographical coordinates - in my case, the >>> listing location) and (higher) group level (territorial units - in my case, >>> zipcodes). >>> >>> According to Bivand et al. (2017): "(...) W is a spatial weights matrix. >>> The HSAR model may also be estimated without this component.". So, in this >>> case I only estimate the Hierarchical Spatial Simultaneous Autoregressive >>> Model (HSAR) in a "one-level" basis, i.e., at the higher-level. >>> >>> HSAR::hsar(model, data = listings, W = NULL, M = M, Delta = Delta, burnin = >>> 5000, Nsim = 10000, thinning = 1, parameters.start = pars) >>> >>> (Where the "model" formula contains the 54 time dummy variables) >>> >>> Do you think I can proceed with this model? I was able to calculate it. >>> >>> If I remove all observations/rows with NAs in one of the chosen >>> variables/observations, 884.183 observations remain. If I would create a W >>> matrix for HSAR::hsar, I would have a gigantic 884.183 by 884.183 matrix. >>> This is the reason why I put W = NULL. >>> >>> >>> Thank you and best regards >>> >>> ________________________________________ >>> From: Roger Bivand <roger.biv...@nhh.no> >>> Sent: Monday, November 11, 2019 11:31 >>> To: Robert R >>> Cc: r-sig-geo@r-project.org >>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method >>> >>> On Sun, 10 Nov 2019, Robert R wrote: >>> >>>> Dear Roger, >>>> >>>> Again, thank you for your answer. I read the material provided and >>>> decided that Hierarchical Spatial Autoregressive (HSAR) could be the >>>> right model for me. >>>> >>>> I indeed have the precise latitude and longitude information for all my >>>> listings for NYC. >>>> >>>> I created a stratified sample (group = zipcode) with 22172 (1%) of my >>>> observations called listings_sample and tried to replicate the hsar >>>> model, please see below. >>>> >>>> For now W = NULL, because otherwise I would have a 22172 x 22172 matrix. >>> >>> Unless you know definitely that you want to relate the response to its >>> lagged value, you do not need this. Do note that the matrix is very >>> sparse, so could be fitted without difficulty with ML in a cross-sectional >>> model. >>> >>>> >>>> You recommended then to introduce a Markov random field (MRF) random >>>> effect (RE) at the zipcode level, but I did not understand it so well. >>>> Could you develop a litte more? >>>> >>> >>> Did you read the development in >>> https://doi.org/10.1016/j.spasta.2017.01.002? It is explained there, and >>> includes code for fitting the Beijing housing parcels data se from HSAR >>> with many other packages (MCMC, INLA, hglm, etc.). I guess that you should >>> try to create a model that works on a single borough, sing the zipcodes >>> in that borough as a proxy for unobserved neighbourhood effects. Try for >>> example using lme4::lmer() with only a zipcode IID random effect, see if >>> the hedonic estimates are similar to lm(), and leave adding an MRF RE >>> (with for example mgcv::gam() or hglm::hglm()) until you have a working >>> testbed. Then advance step-by-step from there. >>> >>> You still have not said how many repeat lettings you see - it will affect >>> the way you specify your model. >>> >>> Roger >>> >>>> ############## >>>> library(spdep) >>>> library(HSAR) >>>> library(dplyr) >>>> library(splitstackshape) >>>> >>>> >>>> # Stratified sample per zipcode (size = 1%) listings_sample <- >>>> splitstackshape::stratified(indt = listings, group = "zipcode", size = >>>> 0.01) >>>> >>>> # Removing zipcodes from polygon_nyc which are not observable in >>>> listings_sample polygon_nyc_listings <- polygon_nyc %>% filter(zipcode >>>> %in% c(unique(as.character(listings_sample$zipcode)))) >>>> >>>> >>>> ## Random effect matrix (N by J) >>>> >>>> # N: 22172 >>>> # J: 154 >>>> >>>> # Arrange listings_sample by zipcode (ascending) >>>> listings_sample <- listings_sample %>% arrange(zipcode) >>>> >>>> # Count number of listings per zipcode >>>> MM <- listings_sample %>% st_drop_geometry() %>% group_by(zipcode) %>% >>>> summarise(count = n()) %>% as.data.frame() >>>> # sum(MM$count) >>>> >>>> # N by J nulled matrix creation >>>> Delta <- matrix(data = 0, nrow = nrow(listings_sample), ncol = dim(MM)[1]) >>>> >>>> # The total number of neighbourhood >>>> Uid <- rep(c(1:dim(MM)[1]), MM[,2]) >>>> >>>> for(i in 1:dim(MM)[1]) { >>>> Delta[Uid==i,i] <- 1 >>>> } >>>> rm(i) >>>> >>>> Delta <- as(Delta,"dgCMatrix") >>>> >>>> >>>> ## Higher-level spatial weights matrix or neighbourhood matrix (J by J) >>>> >>>> # Neighboring polygons: list of neighbors for each polygon (queen >>>> contiguity neighbors) >>>> polygon_nyc_nb <- poly2nb(polygon_nyc_listings, row.names = >>>> polygon_nyc$zipcode, queen = TRUE) >>>> >>>> # Include neighbour itself as a neighbour >>>> polygon_nyc_nb <- include.self(polygon_nyc_nb) >>>> >>>> # Spatial weights matrix for nb >>>> polygon_nyc_nb_matrix <- nb2mat(neighbours = polygon_nyc_nb, style = "W", >>>> zero.policy = NULL) >>>> M <- as(polygon_nyc_nb_matrix,"dgCMatrix") >>>> >>>> >>>> ## Fit HSAR SAR upper level random effect >>>> model <- as.formula(log_price ~ guests_included + minimum_nights) >>>> >>>> betas = coef(lm(formula = model, data = listings_sample)) >>>> pars = list(rho = 0.5, lambda = 0.5, sigma2e = 2.0, sigma2u = 2.0, betas = >>>> betas) >>>> >>>> m_hsar <- hsar(model, data = listings_sample, W = NULL, M = M, Delta = >>>> Delta, burnin = 5000, Nsim = 10000, thinning = 1, parameters.start = pars) >>>> >>>> ############## >>>> >>>> Thank you and best regards >>>> Robert >>>> >>>> ________________________________________ >>>> From: Roger Bivand <roger.biv...@nhh.no> >>>> Sent: Friday, November 8, 2019 13:29 >>>> To: Robert R >>>> Cc: r-sig-geo@r-project.org >>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method >>>> >>>> On Fri, 8 Nov 2019, Robert R wrote: >>>> >>>>> Dear Roger, >>>>> >>>>> Thank you for your answer. >>>>> >>>>> I successfully used the function nb2blocknb() for a smaller dataset. >>>>> >>>>> But for a dataset of over 2 million observations, I get the following >>>>> error: "Error: cannot allocate vector of size 840 Kb". >>>> >>>> I don't think the observations are helpful. If you have repeat lets in the >>>> same property in a given month, you need to handle that anyway. I'd go for >>>> making the modelling exercise work (we agree that this is not panel data, >>>> right?) on a small subset first. I would further argue that you need a >>>> multi-level approach rather than spdep::nb2blocknb(), with a zipcode IID >>>> RE. You could very well take (stratified) samples per zipcode to represent >>>> your data. Once that works, introduce an MRF RE at the zipcode level, >>>> where you do know relative position. Using SARAR is going to be a waste of >>>> time unless you can geocode the letting addresses. A multi-level approach >>>> will work. Having big data in your case with no useful location >>>> information per observation is just adding noise and over-smoothing, I'm >>>> afraid. The approach used in https://doi.org/10.1016/j.spasta.2017.01.002 >>>> will work, also when you sample the within zipcode lets, given a split >>>> into training and test sets, and making CV possible. >>>> >>>> Roger >>>> >>>>> >>>>> I am expecting that at least 500.000 observations will be dropped due >>>>> the lack of values for the chosen variables for the regression model, so >>>>> probably I will filter and remove the observations/rows that will not be >>>>> used anyway - do you know if there is any package that does this >>>>> automatically, given the variables/columns chosed by me? >>>>> >>>>> Or would you recommend me another approach to avoid the above mentioned >>>>> error? >>>>> >>>>> Thank you and best regards, >>>>> Robert >>>>> >>>>> ________________________________________ >>>>> From: Roger Bivand <roger.biv...@nhh.no> >>>>> Sent: Thursday, November 7, 2019 10:13 >>>>> To: Robert R >>>>> Cc: r-sig-geo@r-project.org >>>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method >>>>> >>>>> On Thu, 7 Nov 2019, Robert R wrote: >>>>> >>>>>> Dear Roger, >>>>>> >>>>>> Many thanks for your help. >>>>>> >>>>>> I have an additional question: >>>>>> >>>>>> Is it possible to create a "separate" lw (nb2listw) (with different >>>>>> rownumbers) from my data set? For now, I am taking my data set and >>>>>> merging with the sf object polygon_nyc with the function >>>>>> "merge(polygon_nyc, listings, by=c("zipcode" = "zipcode"))", so I create >>>>>> a huge n x n matrix (depending of the size of my data set). >>>>>> >>>>>> Taking the polygon_nyc alone and turning it to a lw (weights list) >>>>>> object has only n = 177. >>>>>> >>>>>> Of course running >>>>>> >>>>>> spatialreg::lagsarlm(formula=model, data = listings_sample, >>>>>> spatialreg::polygon_nyc_lw, tol.solve=1.0e-10) >>>>>> >>>>>> does not work ("Input data and weights have different dimensions"). >>>>>> >>>>>> The only option is to take my data set, merge it to my polygon_nyc (by >>>>>> zipcode) and then create the weights list lw? Or there another option? >>>>> >>>>> I think we are getting more clarity. You do not know the location of the >>>>> lettings beyond their zipcode. You do know the boundaries of the zipcode >>>>> areas, and can create a neighbour object from these boundaries. You then >>>>> want to treat all the lettings in a zipcode area i as neighbours, and >>>>> additionally lettings in zipcode areas neighbouring i as neighbours of >>>>> lettings in i. This is the data structure that motivated the >>>>> spdep::nb2blocknb() function: >>>>> >>>>> https://r-spatial.github.io/spdep/reference/nb2blocknb.html >>>>> >>>>> Try running the examples to get a feel for what is going on. >>>>> >>>>> I feel that most of the variability will vanish in the very large numbers >>>>> of neighbours, over-smoothing the outcomes. If you do not have locations >>>>> for the lettings themselves, I don't think you can make much progress. >>>>> >>>>> You could try a linear mixed model (or gam with a spatially structured >>>>> random effect) with a temporal and a spatial random effect. See the HSAR >>>>> package, articles by Dong et al., and maybe >>>>> https://doi.org/10.1016/j.spasta.2017.01.002 for another survey. Neither >>>>> this nor Dong et al. handle spatio-temporal settings. MRF spatial random >>>>> effects at the zipcode level might be a way forward, together with an IID >>>>> random effect at the same level (equivalent to sef-neighbours). >>>>> >>>>> Hope this helps, >>>>> >>>>> Roger >>>>> >>>>>> >>>>>> Best regards, >>>>>> Robert >>>>>> >>>>>> ________________________________________ >>>>>> From: Roger Bivand <roger.biv...@nhh.no> >>>>>> Sent: Wednesday, November 6, 2019 15:07 >>>>>> To: Robert R >>>>>> Cc: r-sig-geo@r-project.org >>>>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method >>>>>> >>>>>> On Tue, 5 Nov 2019, Robert R wrote: >>>>>> >>>>>>> Dear Roger, >>>>>>> >>>>>>> Thank you for your reply. I disabled HTML; my e-mails should be now in >>>>>>> plain text. >>>>>>> >>>>>>> I will give a better context for my desired outcome. >>>>>>> >>>>>>> I am taking Airbnb's listings information for New York City available >>>>>>> on: http://insideairbnb.com/get-the-data.html >>>>>>> >>>>>>> I save every listings.csv.gz file available for NYC (2015-01 to 2019-09) >>>>>>> - in total, 54 files/time periods - as a YYYY-MM-DD.csv file into a >>>>>>> Listings/ folder. When importing all these 54 files into one single data >>>>>>> set, I create a new "date_compiled" variable/column. >>>>>>> >>>>>>> In total, after the data cleansing process, I have a little more 2 >>>>>>> million observations. >>>>>> >>>>>> You have repeat lettings for some, but not all properties. So this is at >>>>>> best a very unbalanced panel. For those properties with repeats, you may >>>>>> see temporal movement (trend/seasonal). >>>>>> >>>>>> I suggest (strongly) taking a single borough or even zipcode with some >>>>>> hindreds of properties, and working from there. Do not include the >>>>>> observation as its own neighbour, perhaps identify repeats and handle >>>>>> them >>>>>> specially (create or use a property ID). Unbalanced panels may also >>>>>> create >>>>>> a selection bias issue (why are some properties only listed sometimes?). >>>>>> >>>>>> So this although promising isn't simple, and getting to a hedonic model >>>>>> may be hard, but not (just) because of spatial autocorrelation. I >>>>>> wouldn't >>>>>> necessarily trust OLS output either, partly because of the repeat >>>>>> property >>>>>> issue. >>>>>> >>>>>> Roger >>>>>> >>>>>>> >>>>>>> I created 54 timedummy variables for each time period available. >>>>>>> >>>>>>> I want to estimate using a hedonic spatial timedummy model the impact of >>>>>>> a variety of characteristics which potentially determine the daily rate >>>>>>> on Airbnb listings through time in New York City (e.g. characteristics >>>>>>> of the listing as number of bedrooms, if the host if professional, >>>>>>> proximity to downtown (New York City Hall) and nearest subway station >>>>>>> from the listing, income per capita, etc.). >>>>>>> >>>>>>> My dependent variable is price (log price, common in the related >>>>>>> literature for hedonic prices). >>>>>>> >>>>>>> The OLS model is done. >>>>>>> >>>>>>> For the spatial model, I am assuming that hosts, when deciding the >>>>>>> pricing of their listings, take not only into account its structural and >>>>>>> location characteristics, but also the prices charged by near listings >>>>>>> with similar characteristics - spatial autocorrelation is then present, >>>>>>> at least spatial dependence is present in the dependent variable. >>>>>>> >>>>>>> As I wrote in my previous post, I was willing to consider the neighbor >>>>>>> itself as a neighbor. >>>>>>> >>>>>>> Parts of my code can be found below: >>>>>>> >>>>>>> ######## >>>>>>> >>>>>>> ## packages >>>>>>> >>>>>>> packages_install <- function(packages){ >>>>>>> new.packages <- packages[!(packages %in% installed.packages()[, >>>>>>> "Package"])] >>>>>>> if (length(new.packages)) >>>>>>> install.packages(new.packages, dependencies = TRUE) >>>>>>> sapply(packages, require, character.only = TRUE) >>>>>>> } >>>>>>> >>>>>>> packages_required <- c("bookdown", "cowplot", "data.table", "dplyr", >>>>>>> "e1071", "fastDummies", "ggplot2", "ggrepel", "janitor", "kableExtra", >>>>>>> "knitr", "lubridate", "nngeo", "plm", "RColorBrewer", "readxl", >>>>>>> "scales", "sf", "spdep", "stargazer", "tidyverse") >>>>>>> packages_install(packages_required) >>>>>>> >>>>>>> # Working directory >>>>>>> setwd("C:/Users/User/R") >>>>>>> >>>>>>> >>>>>>> >>>>>>> ## shapefile_us >>>>>>> >>>>>>> # Shapefile zips import and Coordinate Reference System (CRS) >>>>>>> transformation >>>>>>> # Shapefile download: >>>>>>> https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_zcta510_500k.zip >>>>>>> shapefile_us <- sf::st_read(dsn = "Shapefile", layer = >>>>>>> "cb_2018_us_zcta510_500k") >>>>>>> >>>>>>> # Columns removal >>>>>>> shapefile_us <- shapefile_us %>% select(-c(AFFGEOID10, GEOID10, >>>>>>> ALAND10, AWATER10)) >>>>>>> >>>>>>> # Column rename: ZCTA5CE10 >>>>>>> setnames(shapefile_us, old=c("ZCTA5CE10"), new=c("zipcode")) >>>>>>> >>>>>>> # Column class change: zipcode >>>>>>> shapefile_us$zipcode <- as.character(shapefile_us$zipcode) >>>>>>> >>>>>>> >>>>>>> >>>>>>> ## polygon_nyc >>>>>>> >>>>>>> # Zip code not available in shapefile: 11695 >>>>>>> polygon_nyc <- shapefile_us %>% filter(zipcode %in% zips_nyc) >>>>>>> >>>>>>> >>>>>>> >>>>>>> ## weight_matrix >>>>>>> >>>>>>> # Neighboring polygons: list of neighbors for each polygon (queen >>>>>>> contiguity neighbors) >>>>>>> polygon_nyc_nb <- poly2nb((polygon_nyc %>% select(-borough)), >>>>>>> queen=TRUE) >>>>>>> >>>>>>> # Include neighbour itself as a neighbour >>>>>>> # for(i in >>>>>>> 1:length(polygon_nyc_nb)){polygon_nyc_nb[[i]]=as.integer(c(i,polygon_nyc_nb[[i]]))} >>>>>>> polygon_nyc_nb <- include.self(polygon_nyc_nb) >>>>>>> >>>>>>> # Weights to each neighboring polygon >>>>>>> lw <- nb2listw(neighbours = polygon_nyc_nb, style="W", zero.policy=TRUE) >>>>>>> >>>>>>> >>>>>>> >>>>>>> ## listings >>>>>>> >>>>>>> # Data import >>>>>>> files <- list.files(path="Listings/", pattern=".csv", full.names=TRUE) >>>>>>> listings <- setNames(lapply(files, function(x) read.csv(x, >>>>>>> stringsAsFactors = FALSE, encoding="UTF-8")), files) >>>>>>> listings <- mapply(cbind, listings, date_compiled = names(listings)) >>>>>>> listings <- listings %>% bind_rows >>>>>>> >>>>>>> # Characters removal >>>>>>> listings$date_compiled <- gsub("Listings/", "", listings$date_compiled) >>>>>>> listings$date_compiled <- gsub(".csv", "", listings$date_compiled) >>>>>>> listings$price <- gsub("\\$", "", listings$price) >>>>>>> listings$price <- gsub(",", "", listings$price) >>>>>>> >>>>>>> >>>>>>> >>>>>>> ## timedummy >>>>>>> >>>>>>> timedummy <- sapply("date_compiled_", paste, >>>>>>> unique(listings$date_compiled), sep="") >>>>>>> timedummy <- paste(timedummy, sep = "", collapse = " + ") >>>>>>> timedummy <- gsub("-", "_", timedummy) >>>>>>> >>>>>>> >>>>>>> >>>>>>> ## OLS regression >>>>>>> >>>>>>> # Pooled cross-section data - Randomly sampled cross sections of Airbnb >>>>>>> listings price at different points in time >>>>>>> regression <- plm(formula=as.formula(paste("log_price ~ #some >>>>>>> variables", timedummy, sep = "", collapse = " + ")), data=listings, >>>>>>> model="pooling", index="id") >>>>>>> >>>>>>> ######## >>>>>>> >>>>>>> Some of my id's repeat in multiple time periods. >>>>>>> >>>>>>> I use NYC's zip codes to left join my data with the neighborhood zip >>>>>>> code specific characteristics, such as income per capita to that >>>>>>> specific zip code, etc. >>>>>>> >>>>>>> Now I want to apply the hedonic model with the timedummy variables. >>>>>>> >>>>>>> Do you know how to proceed? 1) Which package to use (spdep/splm)?; 2) >>>>>>> Do I have to join the polygon_nyc (by zip code) to my listings data >>>>>>> set, and then calculate the weight matrix "lw"? >>>>>>> >>>>>>> Again, thank you very much for the help provided until now. >>>>>>> >>>>>>> Best regards, >>>>>>> Robert >>>>>>> >>>>>>> ________________________________________ >>>>>>> From: Roger Bivand <roger.biv...@nhh.no> >>>>>>> Sent: Tuesday, November 5, 2019 15:30 >>>>>>> To: Robert R >>>>>>> Cc: r-sig-geo@r-project.org >>>>>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method >>>>>>> >>>>>>> On Tue, 5 Nov 2019, Robert R wrote: >>>>>>> >>>>>>>> I have a large pooled cross-section data set. ?I would like to >>>>>>>> estimate/regress using spatial autocorrelation methods. I am assuming >>>>>>>> for now that spatial dependence is present in both the dependent >>>>>>>> variable and the error term.? ?My data set is over a period of 4 years, >>>>>>>> monthly data (54 periods). For this means, I've created a time dummy >>>>>>>> variable for each time period.? ?I also created a weight matrix using >>>>>>>> the >>>>>>>> functions "poly2nb" and "nb2listw".? ?Now I am trying to figure out a >>>>>>>> way >>>>>>>> to estimate my model which contains a really big data set.? >>>>>>>> ?Basically, my >>>>>>>> model is as follows: y = ?D + ?W1y + X? + ?W2u + ?? ?My questions >>>>>>>> are:? ?1) >>>>>>>> My spatial weight matrix for the whole data set will be probably a >>>>>>>> enormous matrix with submatrices for each time period itself. I don't >>>>>>>> think it would be possible to calculate this.? What I would like to >>>>>>>> know >>>>>>>> is a way to estimate each time dummy/period separately (to compare >>>>>>>> different periods alone). How to do it?? ?2) Which package to use: >>>>>>>> spdep >>>>>>>> or splm?? ?Thank you and best regards,? Robert? >>>>>>> >>>>>>> Please do not post HTML, only plain text. Almost certainly your model >>>>>>> specification is wrong (SARAR/SAC is always a bad idea if alternatives >>>>>>> are >>>>>>> untried). What is your cross-sectional size? Using sparse kronecker >>>>>>> products, the "enormous" matrix may not be very big. Does it make any >>>>>>> sense using time dummies (54 x N x T will be mostly zero anyway)? Are >>>>>>> most >>>>>>> of the covariates time-varying? Please provide motivation and use area >>>>>>> (preferably with affiliation (your email and user name are not >>>>>>> informative) - this feels like a real estate problem, probably wrongly >>>>>>> specified. You should use splm if time make sense in your case, but if >>>>>>> it >>>>>>> really doesn't, simplify your approach, as much of the data will be >>>>>>> subject to very large temporal autocorrelation. >>>>>>> >>>>>>> If this is a continuation of your previous question about using >>>>>>> self-neighbours, be aware that you should not use self-neighbours in >>>>>>> modelling, they are only useful for the Getis-Ord local G_i^* measure. >>>>>>> >>>>>>> Roger >>>>>>> >>>>>>>> >>>>>>>> [[alternative HTML version deleted]] >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> R-sig-Geo mailing list >>>>>>>> R-sig-Geo@r-project.org >>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo >>>>>>> >>>>>>> -- >>>>>>> Roger Bivand >>>>>>> Department of Economics, Norwegian School of Economics, >>>>>>> Helleveien 30, N-5045 Bergen, Norway. >>>>>>> voice: +47 55 95 93 55; e-mail: roger.biv...@nhh.no >>>>>>> https://orcid.org/0000-0003-2392-6140 >>>>>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en >>>>>>> >>>>>> >>>>>> -- >>>>>> Roger Bivand >>>>>> Department of Economics, Norwegian School of Economics, >>>>>> Helleveien 30, N-5045 Bergen, Norway. >>>>>> voice: +47 55 95 93 55; e-mail: roger.biv...@nhh.no >>>>>> https://orcid.org/0000-0003-2392-6140 >>>>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en >>>>>> >>>>> >>>>> -- >>>>> Roger Bivand >>>>> Department of Economics, Norwegian School of Economics, >>>>> Helleveien 30, N-5045 Bergen, Norway. >>>>> voice: +47 55 95 93 55; e-mail: roger.biv...@nhh.no >>>>> https://orcid.org/0000-0003-2392-6140 >>>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en >>>>> >>>> >>>> -- >>>> Roger Bivand >>>> Department of Economics, Norwegian School of Economics, >>>> Helleveien 30, N-5045 Bergen, Norway. >>>> voice: +47 55 95 93 55; e-mail: roger.biv...@nhh.no >>>> https://orcid.org/0000-0003-2392-6140 >>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en >>>> >>> >>> -- >>> Roger Bivand >>> Department of Economics, Norwegian School of Economics, >>> Helleveien 30, N-5045 Bergen, Norway. >>> voice: +47 55 95 93 55; e-mail: roger.biv...@nhh.no >>> https://orcid.org/0000-0003-2392-6140 >>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en >>> >> >> -- >> Roger Bivand >> Department of Economics, Norwegian School of Economics, >> Helleveien 30, N-5045 Bergen, Norway. >> voice: +47 55 95 93 55; e-mail: roger.biv...@nhh.no >> https://orcid.org/0000-0003-2392-6140 >> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en >> > > -- > Roger Bivand > Department of Economics, Norwegian School of Economics, > Helleveien 30, N-5045 Bergen, Norway. > voice: +47 55 95 93 55; e-mail: roger.biv...@nhh.no > https://orcid.org/0000-0003-2392-6140 > https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en > -- Roger Bivand Department of Economics, Norwegian School of Economics, Helleveien 30, N-5045 Bergen, Norway. voice: +47 55 95 93 55; e-mail: roger.biv...@nhh.no https://orcid.org/0000-0003-2392-6140 https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en _______________________________________________ R-sig-Geo mailing list R-sig-Geo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-geo