Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method

Robert R Sat, 14 Dec 2019 11:58:40 -0800

Dear Roger,

Thank you again for your answer. You're really helping me a lot.


I was wondering how to deal with the scales for lme4::lmer, e.g. I have the 
variable income_per_capita which has the income_per_capita for the zipcode in 
which the listing/ad is located. Is it possible to add a 
income_per_capita_zipcode and income_per_capita_borough for the different 
scales/level for the multilevel model?

lme4::lmer(log_price ~ factor(room_type) + bedrooms + bathrooms + 
guests_included + minimum_nights + distance_downtown + distance_subway + 
number_of_reviews + review_scores_cleanliness + professional_host + 
host_is_superhost + is_business_travel_ready + offense_misdemeanor + 
offense_felony + income_per_capita + factor(date_compiled) + (1 | borough / 
zipcode / id), data = listings)

Thank you and best regards

________________________________________
From: Roger Bivand <roger.biv...@nhh.no>
Sent: Monday, December 9, 2019 12:14
To: Robert R
Cc: r-sig-geo@r-project.org
Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method

On Sun, 8 Dec 2019, Robert R wrote:

> Dear Roger,
>
> Thank you for your answer. Regarding the sparse matrix, you're right - I
> tested creating one, as follows:
>
> #####
>
> listings_nb <- listings %>% spdep::poly2nb(queen = TRUE) %>%
> spdep::include.self() # is.symmetric.nb(listings_nb)

Are listings point or polygon support (must be polygon for this function)?

>
> nb_B <- spdep::nb2listw(neighbours = listings_nb, style="B", zero.policy
> = FALSE)
>
> B <- as(nb_B , "CsparseMatrix")
> all(B == t(B))
>
> nb_B1 <- mat2listw(as(B, "dgTMatrix"))
>
> format(object.size(nb_B), units = "Mb")
> # 85.6 Mb
>
> format(object.size(nb_B1), units = "Mb")
> # 85.6 Mb
>
> #####
>
> The size for both objects is the same - different from this example:
> https://cran.r-project.org/web/packages/spdep/vignettes/nb_igraph.html
>
>
> Summing up the data that I have: a "picture" for Airbnb's listings/ads
> once a month for NYC, incl. lat, lon, price (per night), id, and some
> characteristics from the listing/ad as number of rooms, bedrooms, guests
> included, etc. for a period of 54 months. The data was taken here:
> http://insideairbnb.com/get-the-data.html (listings.csv.gz)
>
> For the OLS, I used a pooled OLS with time dummy fixed effects
> (date_compiled, when the "picture" was compiled - how Airbnb listings
> for NYC were shown) because many of my observations (listings id) do not
> repeat for many periods. Also, many listings changed the room_type at
> least once during the whole time period analyzed (3 types of room_type:
> entire home/apt, private room, shared room).
>

OK, I suppose.

> I am now trying a random intercepts three-level hierarchy multilevel
> model, where _id_ (level 1) are nested within _zipcode_ (level 2), and
> the last is nested within _borough_ (level 3). So groups:
> id:(zipcode:borough).
>

I'd try and see what happens, and watch machine resources (memory anyway).

> lme4::lmer(log_price ~ factor(room_type) + bedrooms + bathrooms +
> guests_included + minimum_nights + distance_downtown + distance_subway +
> number_of_reviews + review_scores_cleanliness + professional_host +
> host_is_superhost + is_business_travel_ready + offense_misdemeanor +
> offense_felony + income_per_capita + factor(date_compiled) + (1 |
> borough / zipcode / id), data = listings)
>
> Roger, do you think it is okay to factor(room_type) (for the 3 types of
> room) and factor(date_compiled) for the dates when the NYC Airbnb's
> listings/ads were extracted?

Do you see structured variability coming from these - if so, inclusion
makes sense.

Roger

>
> Thank you and best regards,
> Robert
>
> ________________________________________
> From: Roger Bivand <roger.biv...@nhh.no>
> Sent: Wednesday, December 4, 2019 09:07
> To: Robert R
> Cc: r-sig-geo@r-project.org
> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>
> On Wed, 4 Dec 2019, Robert R wrote:
>
>> Dear Roger,
>>
>> Again, thank you for your answer. What do you mean by "zip code random
>> effect"? You mean I should use in plm the model "random"?
>>
>> regression_re <- plm(formula = model, data = listings, model = "random",
>> index = c("id", "date_compiled"))
>
> No, obviously not, your data are not a balanced panel. I mean a multilevel
> model, where the <200 zip codes cluster the data, and where a zip code
> level IID RE will almost certainly do a better job than dummies. An
> MRF/ICAR RE might be an extension.
>
>>
>> And any other methodology in dealing with large weight matrices in
>> spatialreg::lagsarlm?
>
> Please refer to Bivand et al. (2013) refered to in the package. Probably
> the weights would need to be symmetric and very sparse.
>
> I still think that you should focus on a small subset of the data and to
> improving the signal-noise ratio before trying to scale up.
>
> Roger
>
>>
>> Thank you and best regards,
>> Robert
>>
>> ________________________________________
>> From: Roger Bivand <roger.biv...@nhh.no>
>> Sent: Wednesday, November 27, 2019 13:53
>> To: Robert R
>> Cc: r-sig-geo@r-project.org
>> Subject: SV: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>
>> Yes this is expected, since the # neighbours in a single zip code block is a 
>> dense matrix, and there will be multiple such matrices. (15000^2)*8 is 
>> 1.8e+09 so such a dense matrix will max out your RAM. There is no way to 
>> look at block neighbours in that format without subsetting your data (think 
>> train/test), use a zip code random effect. I would certainly drop all 
>> attempts to examine spatial dependency until you get an aspatial multilevel 
>> hedonic model working.
>>
>> Roger
>>
>> --
>> Roger Bivand
>> Norwegian School of Economics
>> Helleveien 30, 5045 Bergen, Norway
>> roger.biv...@nhh.no
>>
>>
>> ________________________________________
>> Fra: Robert R <userca...@outlook.com>
>> Sendt: tirsdag 26. november 2019 21.04
>> Til: Roger Bivand
>> Kopi: r-sig-geo@r-project.org
>> Emne: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>
>> Dear Roger,
>>
>> Thank you for your e-mail. Actually there is less noise that it seems. 
>> Rental prices are daily rental prices and I have an extract of all Airbnb 
>> listings daily prices once a month for a period of 4 years. Each listings 
>> information contains the lat, lon, number of bedrooms, category (entire 
>> home/apt, shared room or private room), etc.
>>
>> One question regarding the spdep::nb2blocknb function: it runs super fast 
>> with up to n = 1000, and always crashes my R session with n = 15000 or so. 
>> Is there an alternative to solve this problem?
>>
>> Thank you and best regards,
>> Robert
>>
>> ________________________________________
>> From: Roger Bivand <roger.biv...@nhh.no>
>> Sent: Tuesday, November 26, 2019 20:48
>> To: Robert R
>> Cc: r-sig-geo@r-project.org
>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>
>> Sorry for late reply, am indisposed and unable to help further. I feel
>> that there is so much noise in your data (differences in offers, rental
>> lengths, repeats or not, etc.), that you will certainly have to subset
>> vigorously first to isolate response cases that are comparable. What you
>> are trying to disentangle are the hedonic components in the bundle where
>> you just have price as response, but lots of other bundle characteristics
>> on the right hand side (days, etc.). I feel you'd need to try to get to a
>> response index of price per day per rental area or some such. I'd
>> certainly advise examining responses to a specific driver (major concert
>> or sports event) to get a feel for how the market responds, and return to
>> spatial hedonic after finding an approach that gives reasonable aspatial
>> outcomes.
>>
>> Roger
>>
>> On Sun, 17 Nov 2019, Robert R wrote:
>>
>>> Dear Roger,
>>>
>>> Thank you for your message and sorry for my late answer.
>>>
>>> Regarding the number of listings (lettings) for my data set (2.216.642 
>>> observations), each listing contains an individual id:
>>>
>>> unique ids: 180.004
>>> time periods: 54 (2015-01 to 2019-09)
>>> number of ids that appear only once: 28.486 (of 180.004 ids) (15,8%)
>>> number of ids that appear/repeat 2-10 times: 82.641 (of 180.004 ids) (45,9%)
>>> number of ids that appear/repeat 11-30 times: 46.465 (of 180.004 ids) 
>>> (25,8%)
>>> number of ids that appear/repeat 31-54 times: 22.412 (of 180.004 ids) 
>>> (12,5%)
>>>
>>> Important to notice is that hosts can change the room_category (between 
>>> entire/home apt, private room and shared room) keeping the same listing id 
>>> number. In my data, the number of unique ids that in some point changed the 
>>> room_type is of 7.204 ids.
>>>
>>> --
>>>
>>> For the OLS model, I was using only a fixed effect model, where each time 
>>> period (date_compiled) (54 in total) is a time dummy.
>>>
>>> plm::plm(formula = model, data = listings, model = "pooling", index = 
>>> c("id", "date_compiled"))
>>>
>>>
>>> --
>>> Osland et al. (2016) (https://doi.org/10.1111/jors.12281) use a spatial 
>>> fixed effects (SFE) hedonic model, where each defined neighborhood zone in 
>>> the study area is represented by dummy variables.
>>>
>>> Dong et al. (2015) (https://doi.org/10.1111/gean.12049) outline four model 
>>> specifications to accommodate geographically hierarchical data structures: 
>>> (1) groupwise W and fixed regional effects; (2) groupwise W and random 
>>> regional effects; (3) proximity-based W and fixed regional effects; and (4) 
>>> proximity-based W and random regional effects.
>>> --
>>>
>>> I created a new column/variable containing the borough where the zipcode is 
>>> found (Manhattan, Brooklyn, Queens, Bronx, Staten Island).
>>>
>>> If I understood it right, the (two-level) Hierarchical Spatial Simultaneous 
>>> Autoregressive Model (HSAR) considers the occurrence of spatial relations 
>>> at the (lower) individual (geographical coordinates - in my case, the 
>>> listing location) and (higher) group level (territorial units - in my case, 
>>> zipcodes).
>>>
>>> According to Bivand et al. (2017): "(...) W is a spatial weights matrix. 
>>> The HSAR model may also be estimated without this component.". So, in this 
>>> case I only estimate the Hierarchical Spatial Simultaneous Autoregressive 
>>> Model (HSAR) in a "one-level" basis, i.e., at the higher-level.
>>>
>>> HSAR::hsar(model, data = listings, W = NULL, M = M, Delta = Delta, burnin = 
>>> 5000, Nsim = 10000, thinning = 1, parameters.start = pars)
>>>
>>> (Where the "model" formula contains the 54 time dummy variables)
>>>
>>> Do you think I can proceed with this model? I was able to calculate it.
>>>
>>> If I remove all observations/rows with NAs in one of the chosen 
>>> variables/observations, 884.183 observations remain. If I would create a W 
>>> matrix for HSAR::hsar, I would have a gigantic 884.183 by 884.183 matrix. 
>>> This is the reason why I put W = NULL.
>>>
>>>
>>> Thank you and best regards
>>>
>>> ________________________________________
>>> From: Roger Bivand <roger.biv...@nhh.no>
>>> Sent: Monday, November 11, 2019 11:31
>>> To: Robert R
>>> Cc: r-sig-geo@r-project.org
>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>>
>>> On Sun, 10 Nov 2019, Robert R wrote:
>>>
>>>> Dear Roger,
>>>>
>>>> Again, thank you for your answer. I read the material provided and
>>>> decided that Hierarchical Spatial Autoregressive (HSAR) could be the
>>>> right model for me.
>>>>
>>>> I indeed have the precise latitude and longitude information for all my
>>>> listings for NYC.
>>>>
>>>> I created a stratified sample (group = zipcode) with 22172 (1%) of my
>>>> observations called listings_sample and tried to replicate the hsar
>>>> model, please see below.
>>>>
>>>> For now W = NULL, because otherwise I would have a 22172 x 22172 matrix.
>>>
>>> Unless you know definitely that you want to relate the response to its
>>> lagged value, you do not need this. Do note that the matrix is very
>>> sparse, so could be fitted without difficulty with ML in a cross-sectional
>>> model.
>>>
>>>>
>>>> You recommended then to introduce a Markov random field (MRF) random
>>>> effect (RE) at the zipcode level, but I did not understand it so well.
>>>> Could you develop a litte more?
>>>>
>>>
>>> Did you read the development in
>>> https://doi.org/10.1016/j.spasta.2017.01.002? It is explained there, and
>>> includes code for fitting the Beijing housing parcels data se from HSAR
>>> with many other packages (MCMC, INLA, hglm, etc.). I guess that you should
>>> try to create a model that works on a single borough, sing the zipcodes
>>> in that borough as a proxy for unobserved neighbourhood effects. Try for
>>> example using lme4::lmer() with only a zipcode IID random effect, see if
>>> the hedonic estimates are similar to lm(), and leave adding an MRF RE
>>> (with for example mgcv::gam() or hglm::hglm()) until you have a working
>>> testbed. Then advance step-by-step from there.
>>>
>>> You still have not said how many repeat lettings you see - it will affect
>>> the way you specify your model.
>>>
>>> Roger
>>>
>>>> ##############
>>>> library(spdep)
>>>> library(HSAR)
>>>> library(dplyr)
>>>> library(splitstackshape)
>>>>
>>>>
>>>> # Stratified sample per zipcode (size = 1%) listings_sample <-
>>>> splitstackshape::stratified(indt = listings, group = "zipcode", size =
>>>> 0.01)
>>>>
>>>> # Removing zipcodes from polygon_nyc which are not observable in
>>>> listings_sample polygon_nyc_listings <- polygon_nyc %>% filter(zipcode
>>>> %in% c(unique(as.character(listings_sample$zipcode))))
>>>>
>>>>
>>>> ## Random effect matrix (N by J)
>>>>
>>>> # N: 22172
>>>> # J: 154
>>>>
>>>> # Arrange listings_sample by zipcode (ascending)
>>>> listings_sample <- listings_sample %>% arrange(zipcode)
>>>>
>>>> # Count number of listings per zipcode
>>>> MM <- listings_sample %>% st_drop_geometry() %>% group_by(zipcode) %>% 
>>>> summarise(count = n()) %>% as.data.frame()
>>>> # sum(MM$count)
>>>>
>>>> # N by J nulled matrix creation
>>>> Delta <- matrix(data = 0, nrow = nrow(listings_sample), ncol = dim(MM)[1])
>>>>
>>>> # The total number of neighbourhood
>>>> Uid <- rep(c(1:dim(MM)[1]), MM[,2])
>>>>
>>>> for(i in 1:dim(MM)[1]) {
>>>>  Delta[Uid==i,i] <- 1
>>>> }
>>>> rm(i)
>>>>
>>>> Delta <- as(Delta,"dgCMatrix")
>>>>
>>>>
>>>> ## Higher-level spatial weights matrix or neighbourhood matrix (J by J)
>>>>
>>>> # Neighboring polygons: list of neighbors for each polygon (queen 
>>>> contiguity neighbors)
>>>> polygon_nyc_nb <- poly2nb(polygon_nyc_listings, row.names = 
>>>> polygon_nyc$zipcode, queen = TRUE)
>>>>
>>>> # Include neighbour itself as a neighbour
>>>> polygon_nyc_nb <- include.self(polygon_nyc_nb)
>>>>
>>>> # Spatial weights matrix for nb
>>>> polygon_nyc_nb_matrix <- nb2mat(neighbours = polygon_nyc_nb, style = "W", 
>>>> zero.policy = NULL)
>>>> M <- as(polygon_nyc_nb_matrix,"dgCMatrix")
>>>>
>>>>
>>>> ## Fit HSAR SAR upper level random effect
>>>> model <- as.formula(log_price ~ guests_included + minimum_nights)
>>>>
>>>> betas = coef(lm(formula = model, data = listings_sample))
>>>> pars = list(rho = 0.5, lambda = 0.5, sigma2e = 2.0, sigma2u = 2.0, betas = 
>>>> betas)
>>>>
>>>> m_hsar <- hsar(model, data = listings_sample, W = NULL, M = M, Delta = 
>>>> Delta, burnin = 5000, Nsim = 10000, thinning = 1, parameters.start = pars)
>>>>
>>>> ##############
>>>>
>>>> Thank you and best regards
>>>> Robert
>>>>
>>>> ________________________________________
>>>> From: Roger Bivand <roger.biv...@nhh.no>
>>>> Sent: Friday, November 8, 2019 13:29
>>>> To: Robert R
>>>> Cc: r-sig-geo@r-project.org
>>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>>>
>>>> On Fri, 8 Nov 2019, Robert R wrote:
>>>>
>>>>> Dear Roger,
>>>>>
>>>>> Thank you for your answer.
>>>>>
>>>>> I successfully used the function nb2blocknb() for a smaller dataset.
>>>>>
>>>>> But for a dataset of over 2 million observations, I get the following
>>>>> error: "Error: cannot allocate vector of size 840 Kb".
>>>>
>>>> I don't think the observations are helpful. If you have repeat lets in the
>>>> same property in a given month, you need to handle that anyway. I'd go for
>>>> making the modelling exercise work (we agree that this is not panel data,
>>>> right?) on a small subset first. I would further argue that you need a
>>>> multi-level approach rather than spdep::nb2blocknb(), with a zipcode IID
>>>> RE. You could very well take (stratified) samples per zipcode to represent
>>>> your data. Once that works, introduce an MRF RE at the zipcode level,
>>>> where you do know relative position. Using SARAR is going to be a waste of
>>>> time unless you can geocode the letting addresses. A multi-level approach
>>>> will work. Having big data in your case with no useful location
>>>> information per observation is just adding noise and over-smoothing, I'm
>>>> afraid. The approach used in https://doi.org/10.1016/j.spasta.2017.01.002
>>>> will work, also when you sample the within zipcode lets, given a split
>>>> into training and test sets, and making CV possible.
>>>>
>>>> Roger
>>>>
>>>>>
>>>>> I am expecting that at least 500.000 observations will be dropped due
>>>>> the lack of values for the chosen variables for the regression model, so
>>>>> probably I will filter and remove the observations/rows that will not be
>>>>> used anyway - do you know if there is any package that does this
>>>>> automatically, given the variables/columns chosed by me?
>>>>>
>>>>> Or would you recommend me another approach to avoid the above mentioned
>>>>> error?
>>>>>
>>>>> Thank you and best regards,
>>>>> Robert
>>>>>
>>>>> ________________________________________
>>>>> From: Roger Bivand <roger.biv...@nhh.no>
>>>>> Sent: Thursday, November 7, 2019 10:13
>>>>> To: Robert R
>>>>> Cc: r-sig-geo@r-project.org
>>>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>>>>
>>>>> On Thu, 7 Nov 2019, Robert R wrote:
>>>>>
>>>>>> Dear Roger,
>>>>>>
>>>>>> Many thanks for your help.
>>>>>>
>>>>>> I have an additional question:
>>>>>>
>>>>>> Is it possible to create a "separate" lw (nb2listw) (with different
>>>>>> rownumbers) from my data set? For now, I am taking my data set and
>>>>>> merging with the sf object polygon_nyc with the function
>>>>>> "merge(polygon_nyc, listings, by=c("zipcode" = "zipcode"))", so I create
>>>>>> a huge n x n matrix (depending of the size of my data set).
>>>>>>
>>>>>> Taking the polygon_nyc alone and turning it to a lw (weights list)
>>>>>> object has only n = 177.
>>>>>>
>>>>>> Of course running
>>>>>>
>>>>>> spatialreg::lagsarlm(formula=model, data = listings_sample,
>>>>>> spatialreg::polygon_nyc_lw, tol.solve=1.0e-10)
>>>>>>
>>>>>> does not work ("Input data and weights have different dimensions").
>>>>>>
>>>>>> The only option is to take my data set, merge it to my polygon_nyc (by
>>>>>> zipcode) and then create the weights list lw? Or there another option?
>>>>>
>>>>> I think we are getting more clarity. You do not know the location of the
>>>>> lettings beyond their zipcode. You do know the boundaries of the zipcode
>>>>> areas, and can create a neighbour object from these boundaries. You then
>>>>> want to treat all the lettings in a zipcode area i as neighbours, and
>>>>> additionally lettings in zipcode areas neighbouring i as neighbours of
>>>>> lettings in i. This is the data structure that motivated the
>>>>> spdep::nb2blocknb() function:
>>>>>
>>>>> https://r-spatial.github.io/spdep/reference/nb2blocknb.html
>>>>>
>>>>> Try running the examples to get a feel for what is going on.
>>>>>
>>>>> I feel that most of the variability will vanish in the very large numbers
>>>>> of neighbours, over-smoothing the outcomes. If you do not have locations
>>>>> for the lettings themselves, I don't think you can make much progress.
>>>>>
>>>>> You could try a linear mixed model (or gam with a spatially structured
>>>>> random effect) with a temporal and a spatial random effect. See the HSAR
>>>>> package, articles by Dong et al., and maybe
>>>>> https://doi.org/10.1016/j.spasta.2017.01.002 for another survey. Neither
>>>>> this nor Dong et al. handle spatio-temporal settings. MRF spatial random
>>>>> effects at the zipcode level might be a way forward, together with an IID
>>>>> random effect at the same level (equivalent to sef-neighbours).
>>>>>
>>>>> Hope this helps,
>>>>>
>>>>> Roger
>>>>>
>>>>>>
>>>>>> Best regards,
>>>>>> Robert
>>>>>>
>>>>>> ________________________________________
>>>>>> From: Roger Bivand <roger.biv...@nhh.no>
>>>>>> Sent: Wednesday, November 6, 2019 15:07
>>>>>> To: Robert R
>>>>>> Cc: r-sig-geo@r-project.org
>>>>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>>>>>
>>>>>> On Tue, 5 Nov 2019, Robert R wrote:
>>>>>>
>>>>>>> Dear Roger,
>>>>>>>
>>>>>>> Thank you for your reply. I disabled HTML; my e-mails should be now in
>>>>>>> plain text.
>>>>>>>
>>>>>>> I will give a better context for my desired outcome.
>>>>>>>
>>>>>>> I am taking Airbnb's listings information for New York City available
>>>>>>> on: http://insideairbnb.com/get-the-data.html
>>>>>>>
>>>>>>> I save every listings.csv.gz file available for NYC (2015-01 to 2019-09)
>>>>>>> - in total, 54 files/time periods - as a YYYY-MM-DD.csv file into a
>>>>>>> Listings/ folder. When importing all these 54 files into one single data
>>>>>>> set, I create a new "date_compiled" variable/column.
>>>>>>>
>>>>>>> In total, after the data cleansing process, I have a little more 2
>>>>>>> million observations.
>>>>>>
>>>>>> You have repeat lettings for some, but not all properties. So this is at
>>>>>> best a very unbalanced panel. For those properties with repeats, you may
>>>>>> see temporal movement (trend/seasonal).
>>>>>>
>>>>>> I suggest (strongly) taking a single borough or even zipcode with some
>>>>>> hindreds of properties, and working from there. Do not include the
>>>>>> observation as its own neighbour, perhaps identify repeats and handle 
>>>>>> them
>>>>>> specially (create or use a property ID). Unbalanced panels may also 
>>>>>> create
>>>>>> a selection bias issue (why are some properties only listed sometimes?).
>>>>>>
>>>>>> So this although promising isn't simple, and getting to a hedonic model
>>>>>> may be hard, but not (just) because of spatial autocorrelation. I 
>>>>>> wouldn't
>>>>>> necessarily trust OLS output either, partly because of the repeat 
>>>>>> property
>>>>>> issue.
>>>>>>
>>>>>> Roger
>>>>>>
>>>>>>>
>>>>>>> I created 54 timedummy variables for each time period available.
>>>>>>>
>>>>>>> I want to estimate using a hedonic spatial timedummy model the impact of
>>>>>>> a variety of characteristics which potentially determine the daily rate
>>>>>>> on Airbnb listings through time in New York City (e.g. characteristics
>>>>>>> of the listing as number of bedrooms, if the host if professional,
>>>>>>> proximity to downtown (New York City Hall) and nearest subway station
>>>>>>> from the listing, income per capita, etc.).
>>>>>>>
>>>>>>> My dependent variable is price (log price, common in the related
>>>>>>> literature for hedonic prices).
>>>>>>>
>>>>>>> The OLS model is done.
>>>>>>>
>>>>>>> For the spatial model, I am assuming that hosts, when deciding the
>>>>>>> pricing of their listings, take not only into account its structural and
>>>>>>> location characteristics, but also the prices charged by near listings
>>>>>>> with similar characteristics - spatial autocorrelation is then present,
>>>>>>> at least spatial dependence is present in the dependent variable.
>>>>>>>
>>>>>>> As I wrote in my previous post, I was willing to consider the neighbor
>>>>>>> itself as a neighbor.
>>>>>>>
>>>>>>> Parts of my code can be found below:
>>>>>>>
>>>>>>> ########
>>>>>>>
>>>>>>> ## packages
>>>>>>>
>>>>>>> packages_install <- function(packages){
>>>>>>> new.packages <- packages[!(packages %in% installed.packages()[, 
>>>>>>> "Package"])]
>>>>>>> if (length(new.packages))
>>>>>>> install.packages(new.packages, dependencies = TRUE)
>>>>>>> sapply(packages, require, character.only = TRUE)
>>>>>>> }
>>>>>>>
>>>>>>> packages_required <- c("bookdown", "cowplot", "data.table", "dplyr", 
>>>>>>> "e1071", "fastDummies", "ggplot2", "ggrepel", "janitor", "kableExtra", 
>>>>>>> "knitr", "lubridate", "nngeo", "plm", "RColorBrewer", "readxl", 
>>>>>>> "scales", "sf", "spdep", "stargazer", "tidyverse")
>>>>>>> packages_install(packages_required)
>>>>>>>
>>>>>>> # Working directory
>>>>>>> setwd("C:/Users/User/R")
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ## shapefile_us
>>>>>>>
>>>>>>> # Shapefile zips import and Coordinate Reference System (CRS) 
>>>>>>> transformation
>>>>>>> # Shapefile download: 
>>>>>>> https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_zcta510_500k.zip
>>>>>>> shapefile_us <- sf::st_read(dsn = "Shapefile", layer = 
>>>>>>> "cb_2018_us_zcta510_500k")
>>>>>>>
>>>>>>> # Columns removal
>>>>>>> shapefile_us <- shapefile_us %>% select(-c(AFFGEOID10, GEOID10, 
>>>>>>> ALAND10, AWATER10))
>>>>>>>
>>>>>>> # Column rename: ZCTA5CE10
>>>>>>> setnames(shapefile_us, old=c("ZCTA5CE10"), new=c("zipcode"))
>>>>>>>
>>>>>>> # Column class change: zipcode
>>>>>>> shapefile_us$zipcode <- as.character(shapefile_us$zipcode)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ## polygon_nyc
>>>>>>>
>>>>>>> # Zip code not available in shapefile: 11695
>>>>>>> polygon_nyc <- shapefile_us %>% filter(zipcode %in% zips_nyc)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ## weight_matrix
>>>>>>>
>>>>>>> # Neighboring polygons: list of neighbors for each polygon (queen 
>>>>>>> contiguity neighbors)
>>>>>>> polygon_nyc_nb <- poly2nb((polygon_nyc %>% select(-borough)), 
>>>>>>> queen=TRUE)
>>>>>>>
>>>>>>> # Include neighbour itself as a neighbour
>>>>>>> # for(i in 
>>>>>>> 1:length(polygon_nyc_nb)){polygon_nyc_nb[[i]]=as.integer(c(i,polygon_nyc_nb[[i]]))}
>>>>>>> polygon_nyc_nb <- include.self(polygon_nyc_nb)
>>>>>>>
>>>>>>> # Weights to each neighboring polygon
>>>>>>> lw <- nb2listw(neighbours = polygon_nyc_nb, style="W", zero.policy=TRUE)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ## listings
>>>>>>>
>>>>>>> # Data import
>>>>>>> files <- list.files(path="Listings/", pattern=".csv", full.names=TRUE)
>>>>>>> listings <- setNames(lapply(files, function(x) read.csv(x, 
>>>>>>> stringsAsFactors = FALSE, encoding="UTF-8")), files)
>>>>>>> listings <- mapply(cbind, listings, date_compiled = names(listings))
>>>>>>> listings <- listings %>% bind_rows
>>>>>>>
>>>>>>> # Characters removal
>>>>>>> listings$date_compiled <- gsub("Listings/", "", listings$date_compiled)
>>>>>>> listings$date_compiled <- gsub(".csv", "", listings$date_compiled)
>>>>>>> listings$price <- gsub("\\$", "", listings$price)
>>>>>>> listings$price <- gsub(",", "", listings$price)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ## timedummy
>>>>>>>
>>>>>>> timedummy <- sapply("date_compiled_", paste, 
>>>>>>> unique(listings$date_compiled), sep="")
>>>>>>> timedummy <- paste(timedummy, sep = "", collapse = " + ")
>>>>>>> timedummy <- gsub("-", "_", timedummy)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ## OLS regression
>>>>>>>
>>>>>>> # Pooled cross-section data - Randomly sampled cross sections of Airbnb 
>>>>>>> listings price at different points in time
>>>>>>> regression <- plm(formula=as.formula(paste("log_price ~ #some 
>>>>>>> variables", timedummy, sep = "", collapse = " + ")), data=listings, 
>>>>>>> model="pooling", index="id")
>>>>>>>
>>>>>>> ########
>>>>>>>
>>>>>>> Some of my id's repeat in multiple time periods.
>>>>>>>
>>>>>>> I use NYC's zip codes to left join my data with the neighborhood zip 
>>>>>>> code specific characteristics, such as income per capita to that 
>>>>>>> specific zip code, etc.
>>>>>>>
>>>>>>> Now I want to apply the hedonic model with the timedummy variables.
>>>>>>>
>>>>>>> Do you know how to proceed? 1) Which package to use (spdep/splm)?; 2) 
>>>>>>> Do I have to join the polygon_nyc (by zip code) to my listings data 
>>>>>>> set, and then calculate the weight matrix "lw"?
>>>>>>>
>>>>>>> Again, thank you very much for the help provided until now.
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Robert
>>>>>>>
>>>>>>> ________________________________________
>>>>>>> From: Roger Bivand <roger.biv...@nhh.no>
>>>>>>> Sent: Tuesday, November 5, 2019 15:30
>>>>>>> To: Robert R
>>>>>>> Cc: r-sig-geo@r-project.org
>>>>>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>>>>>>
>>>>>>> On Tue, 5 Nov 2019, Robert R wrote:
>>>>>>>
>>>>>>>> I have a large pooled cross-section data set. ?I would like to
>>>>>>>> estimate/regress using spatial autocorrelation methods. I am assuming
>>>>>>>> for now that spatial dependence is present in both the dependent
>>>>>>>> variable and the error term.? ?My data set is over a period of 4 years,
>>>>>>>> monthly data (54 periods). For this means, I've created a time dummy
>>>>>>>> variable for each time period.? ?I also created a weight matrix using 
>>>>>>>> the
>>>>>>>> functions "poly2nb" and "nb2listw".? ?Now I am trying to figure out a 
>>>>>>>> way
>>>>>>>> to estimate my model which contains a really big data set.? 
>>>>>>>> ?Basically, my
>>>>>>>> model is as follows: y = ?D + ?W1y + X? + ?W2u + ?? ?My questions 
>>>>>>>> are:? ?1)
>>>>>>>> My spatial weight matrix for the whole data set will be probably a
>>>>>>>> enormous matrix with submatrices for each time period itself. I don't
>>>>>>>> think it would be possible to calculate this.? What I would like to 
>>>>>>>> know
>>>>>>>> is a way to estimate each time dummy/period separately (to compare
>>>>>>>> different periods alone). How to do it?? ?2) Which package to use: 
>>>>>>>> spdep
>>>>>>>> or splm?? ?Thank you and best regards,? Robert?
>>>>>>>
>>>>>>> Please do not post HTML, only plain text. Almost certainly your model
>>>>>>> specification is wrong (SARAR/SAC is always a bad idea if alternatives 
>>>>>>> are
>>>>>>> untried). What is your cross-sectional size? Using sparse kronecker
>>>>>>> products, the "enormous" matrix may not be very big. Does it make any
>>>>>>> sense using time dummies (54 x N x T will be mostly zero anyway)? Are 
>>>>>>> most
>>>>>>> of the covariates time-varying? Please provide motivation and use area
>>>>>>> (preferably with affiliation (your email and user name are not
>>>>>>> informative) - this feels like a real estate problem, probably wrongly
>>>>>>> specified. You should use splm if time make sense in your case, but if 
>>>>>>> it
>>>>>>> really doesn't, simplify your approach, as much of the data will be
>>>>>>> subject to very large temporal autocorrelation.
>>>>>>>
>>>>>>> If this is a continuation of your previous question about using
>>>>>>> self-neighbours, be aware that you should not use self-neighbours in
>>>>>>> modelling, they are only useful for the Getis-Ord local G_i^* measure.
>>>>>>>
>>>>>>> Roger
>>>>>>>
>>>>>>>>
>>>>>>>>       [[alternative HTML version deleted]]
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> R-sig-Geo mailing list
>>>>>>>> R-sig-Geo@r-project.org
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>>>>>>
>>>>>>> --
>>>>>>> Roger Bivand
>>>>>>> Department of Economics, Norwegian School of Economics,
>>>>>>> Helleveien 30, N-5045 Bergen, Norway.
>>>>>>> voice: +47 55 95 93 55; e-mail: roger.biv...@nhh.no
>>>>>>> https://orcid.org/0000-0003-2392-6140
>>>>>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Roger Bivand
>>>>>> Department of Economics, Norwegian School of Economics,
>>>>>> Helleveien 30, N-5045 Bergen, Norway.
>>>>>> voice: +47 55 95 93 55; e-mail: roger.biv...@nhh.no
>>>>>> https://orcid.org/0000-0003-2392-6140
>>>>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>>>>
>>>>>
>>>>> --
>>>>> Roger Bivand
>>>>> Department of Economics, Norwegian School of Economics,
>>>>> Helleveien 30, N-5045 Bergen, Norway.
>>>>> voice: +47 55 95 93 55; e-mail: roger.biv...@nhh.no
>>>>> https://orcid.org/0000-0003-2392-6140
>>>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>>>
>>>>
>>>> --
>>>> Roger Bivand
>>>> Department of Economics, Norwegian School of Economics,
>>>> Helleveien 30, N-5045 Bergen, Norway.
>>>> voice: +47 55 95 93 55; e-mail: roger.biv...@nhh.no
>>>> https://orcid.org/0000-0003-2392-6140
>>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>>
>>>
>>> --
>>> Roger Bivand
>>> Department of Economics, Norwegian School of Economics,
>>> Helleveien 30, N-5045 Bergen, Norway.
>>> voice: +47 55 95 93 55; e-mail: roger.biv...@nhh.no
>>> https://orcid.org/0000-0003-2392-6140
>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>
>>
>> --
>> Roger Bivand
>> Department of Economics, Norwegian School of Economics,
>> Helleveien 30, N-5045 Bergen, Norway.
>> voice: +47 55 95 93 55; e-mail: roger.biv...@nhh.no
>> https://orcid.org/0000-0003-2392-6140
>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>
>
> --
> Roger Bivand
> Department of Economics, Norwegian School of Economics,
> Helleveien 30, N-5045 Bergen, Norway.
> voice: +47 55 95 93 55; e-mail: roger.biv...@nhh.no
> https://orcid.org/0000-0003-2392-6140
> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>

--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: roger.biv...@nhh.no
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en

_______________________________________________
R-sig-Geo mailing list
R-sig-Geo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-geo

Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method

Reply via email to