On Sun, 14 May 2017, Tiernan Martin wrote:
You’re right about one thing: those of us introduced to the pipe in our
formative years find it hard to escape (everything is better here inside
the pipe - right?) ;)
Pipes can be great if they don't involve extra inputs at a later stage in
the pipe, but as:
https://github.com/thomasp85/tidygraph
suggests, it isn't very easy, and may be forced.
Thank you for your very informative response. Before reading it, I had a
vague inkling that matrices were part of the challenge of adapting these
tools to work with simple features, but I also thought perhaps all this
issue needed was a little nudge from a user like me. Your response made
it
clear that there is a series of technical decisions that need to be made,
and in order to do that I imagine some exploratory testing is in order.
Of
course, it is possible that the data.frame + list-col foundation is just
not well-suited to the needs of spatial weighting/testing tools, but I
suspect that there is some valuable insight to be gained from exploring
this possibility further. And as I stated in my first post, going
forward I
expect more users will be interested in seeing this sort of integration
happen.
I think that there are two tasks - one to create neighbour objects from sf
objects, then another to see whether the costs of putting the spatial
weights into a data.frame (or even more challenging, a tibble) to test for
spatial autocorrelation or model spatial dependence. Should the weights be
squirreled away inside the object flowing down the pipe?
sf_object %>% add_weights(...) -> sf_object_with_weights
moran.test(sf_object_with_weights, ...)
geary.test(sf_object_with_weights, ...)
or
sf_object %>% add_weights(...) %>% moran.test(sf_object_with_weights, ...)
-> output
all in one (output is a htest object). It gets more complex in:
lm.morantest(lm(formula, sf_object), sf_object_with_weights)
as we need the weights and the lm output object.
How should we go about this technically? spdep is on R-Forge and is
happy
there. Its present functionality has to be maintained as it stands, it
has
too many reverse dependencies to break. Should it be re-written from
scratch (with more sparse matrices internally for computation)?
...
We'd need proof of concept with realistically sized data sets (not yet
NY
taxis, but maybe later ...). spdep started as spweights, sptests and
spdep, and the first two got folded into the third when stable. If
weights
are the first thing to go for, sfweights is where to go first (and port
the STRtree envelope intersections for contiguities). It could define
the
new classes, and tests and modelling would use them.
Sounds to me like this package would need to be built from the ground up,
possibly following a similar path to the `sp` development process as you
mentioned. Maybe that presents a challenge with regards to resources
(i.e.,
funding, time, etc.). Perhaps project this is a candidate for future
proposals to the R Consortium or other funding sources. I hope that this
post is useful in documenting user interest in seeing the `sf` protocol
integrated into other spatial tools and workflows, and I look forward to
hearing others' thoughts on the matter.
Making spatial weights from sf objects is just expanding spdep, and is
feasible. There is no point looking for funding, all it needs is knowledge
of how things have been done in spdep. Contiguity, knn, distance,
graph-based neighbours are all feasible.
Extending this to adding weights to sf objects may be going too far. It
seems slightly forced to - say - attach a sparse matrix to an sf geometry
column as an attribute, or to add a neighbour nested list column when
writing subsetting protection is very far from obvious. Like time series
data (not time stamped observations, but real ordered time series),
spatial data have implicit order that is tidy in a different way than tidy
(graph dependencies between observation rows). Maybe mapped features with
attached attribute data are "tidier" than a table, because they would
preserve their relative positions?
Roger
Thanks again,
Tiernan
On Sat, May 13, 2017 at 1:39 PM Roger Bivand <roger.biv...@nhh.no>
wrote:
On Fri, 12 May 2017, Tiernan Martin wrote:
Is anyone thinking about creating an adaptation of the `spdep` package
that
expects sf-class inputs and works well in a pipeline?
I assume that "this is not a pipeline" is a joke for insiders (in the
pipe)?
No, there is your issue on the sfr github repository that is relevant
for
contiguous neighbours, but not beyond that:
https://github.com/edzer/sfr/issues/234
An sf is a data.frame, and as such should "just work", like "Spatial"
objects have, in formula/data settings. The problem is (of course) the
weights matrix.
Should it be a list column (each row has a list nesting two lists, first
indices, second non-zero weights), or remain separate as it has been for
20 years, or become a column-oriented representation (Matrix package) -
a
nested list like a list column or a listw obeject is row-oriented. I had
started thinking about using a sparse column-oriented representation,
but
have not implemented functions accepting them instead of listw objects.
I am very cautious about creating classes for data that were data.frame,
then sf, and then have the weights built-in. In the simple case it would
work, but all you have to do is re-order the rows and the link between
the
neighbour ids and row order breaks down; the same applies to subsetting.
The problems to solve first are related the workflows, and easiest to
look
at in the univariate case (Moran, Geary, join-count, Mantel, G, ...) for
global and local tests. I think that all or almost all of the NSE verbs
will cause chaos (mutate, select, group) once weights have been created.
If there is a way to bind the neighour IDs to the input sf object rows,
a
list column might be possible, but we have to permit multiple such
columns
(Queen, Rook, ...), and ensure that subsetting and row-reordering keep
the
graph relations intact (and their modifications, like
row-standardisation)
If you've seen any examples of how tidy relates to graphs (adjacency
lists
are like nb objects), we could learn from that.
How should we go about this technically? spdep is on R-Forge and is
happy
there. Its present functionality has to be maintained as it stands, it
has
too many reverse dependencies to break. Should it be re-written from
scratch (with more sparse matrices internally for computation)?
Your example creates weights on the fly for a local G* map, but has
n=100,
not say n=90000 (LA census blocks). Using the sf_ geos based methods
does
not use STRtrees, which cut the time needed to find contigious
neighbours
from days to seconds. We ought to pre-compute weights, but this messes
up
the flow of data, because a second lot of data (the weights) have to
enter
the pipe, and be matched with the row-ids.
We'd need proof of concept with realistically sized data sets (not yet
NY
taxis, but maybe later ...). spdep started as spweights, sptests and
spdep, and the first two got folded into the third when stable. If
weights
are the first thing to go for, sfweights is where to go first (and port
the STRtree envelope intersections for contiguities). It could define
the
new classes, and tests and modelling would use them.
Thanks for starting this discussion.
Roger
I understand that there is skepticism about the wisdom of adopting the
“tidyverse” principles throughout the R package ecosystem, and I share
the
concern that an over-reliance on any single paradigm could reduce the
resilience and diversity of the system as a whole.
That said, I believe that the enthusiastic adoption of the `sf` package
and
the package's connections with widely-used tidyverse packages like
`dplyr`
and `ggplot2` may result in increased demand for sf-friendly spatial
analysis tools. As an amateur who recently started using R as my
primary
GIS tool, it seems like the tidyverse's preference for dataframes, S3
objects, list columns, and pipeline workflows would be well-suited to
the
field of spatial analysis. Are there some fundamental reasons why the
`spdep` tools cannot (or should not) be adapted to the tidyverse
"dialect"?
Let me put the question in the context of an actual analysis: in
February
2017, the pop culture infovis website The Pudding (
https://pudding.cool/
)
published an analysis of regional preferences for Oscar-nominated films
in
the US (https://pudding.cool/2017/02/oscars_so_mapped/). A few days
ago,
the author posted a tutorial explaining the method of “regional
smoothing”
used to create the article’s choropleths (
https://pudding.cool/process/regional_smoothing/).
The method relies on several `spdep` functions (
https://github.com/polygraph-cool/smoothing_tutorial/blob/master/smoothing_tutorial.R
).
In the code below, I provide reprex with a smaller dataset included in
the
`sf` package:
library(sf)
library(spdep)
nc <- st_read(system.file("shape/nc.shp", package = "sf")) # North
Carolina counties
nc_shp <- as(nc,'Spatial')
coords <- coordinates(nc_shp)
IDs<-row.names(as(nc_shp, "data.frame"))
knn5 <- knn2nb(knearneigh(coords, k = 5), row.names = IDs) # find the
nearest neighbors for each county
knn5 <- include.self(knn5)
localGvalues <- localG(x = as.numeric(nc_shp@data$NWBIR74), listw =
nb2listw(knn5, style = "B"), zero.policy = TRUE) # calculate the G
scores
localGvalues <- round(localGvalues,3)
nc_shp@data$LOCAL_G <- as.numeric(localGvalues)
p1 <- spplot(nc_shp, c('NWBIR74'))
p2 <- spplot(nc_shp, c('LOCAL_G'))
plot(p1, split=c(1,1,2,2), more=TRUE)
plot(p2, split=c(1,2,2,2), more=TRUE)
Here’s what I imagine that would look like in a tidyverse pipeline
(please
note that this code is for illustrative purposes and will not run):
library(tidyverse)
library(purrr)
library(sf)
library(sfdep) # this package doesn't exist (yet)
nc <- st_read(system.file("shape/nc.shp", package = "sf"))
nc_g <-
nc %>%
mutate(KNN = map(.x = geometry, ~ sfdep::st_knn(.x, k = 5,
include.self
=
TRUE)), # find the nearest neighbors for each county
NB_LIST = map(.x = KNN, ~ sfdep::st_nb_list(.x, style =
'B')), #
make a list of the neighbors using the binary method
LOCAL_G = sfdep::st_localG(x = NWBIR74, listw = NB_LIST,
zero.policy = TRUE), # calculate the G scores
LOCAL_G = round(LOCAL_G,3))
We can see that the (hypothetical) tidyverse version reduces the amount
of
intermediate objects and wraps the creation of the G scores into a
single
code chunk with clear steps.
I'd be grateful to hear from the users and developers of the `spdep`
and
`sf` packages about this topic!
Tiernan Martin
[[alternative HTML version deleted]]
_______________________________________________
R-sig-Geo mailing list
R-sig-Geo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55 <+47%2055%2095%2093%2055>
<+47%2055%2095%2093%2055>; e-mail:
roger.biv...@nhh.no
Editor-in-Chief of The R Journal,
https://journal.r-project.org/index.html
http://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55 <+47%2055%2095%2093%2055>; e-mail:
roger.biv...@nhh.no
Editor-in-Chief of The R Journal, https://journal.r-project.org/index.html
http://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en