Re: [R] Transport and Earth Mover's Distance

2017-03-09 Thread Schuhmacher, Dominic

> Am 08.03.2017 um 11:28 schrieb Schuhmacher, Dominic 
> :
> 
> ...
>>> 
>>> If you have no particular need for binning, check out the function
>>> pppdist in the R-package spatstat, which offers a more flexible way
>>> to deal with point patterns of different size.
>> 
>> 
>> Well, this is not clear, but possibly very important for me.
>> My raw data consists of 2 univariate samples of unequal length.
>> 
>> suppose that
>> 
>> x<-rnorm(100)
>> 
>> and
>> 
>> y<-rnorm(90)
>> 
>> Is there a way to define the Wasserstein distance between them without
>> going through the binning procedure?
>> 
> Define, yes: the 1-Wasserstein distance in one-dimension is the area between 
> the empirical cumulative distribution functions. If the samples had the same 
> lengths this could be directly computed by
> 
> mean(abs(sort(x)-sort(y)))
> 
> Otherwise this needs some lines of code. I will include it in the next 
> version of the transport package (soon).
> 
> Best regards,
> Dominic
> 
> 
Following up on this earlier post: transport 0.8-2, which is on CRAN now, 
offers the possibility to compute the Wasserstein distance between univariate 
samples of differing lengths (more precisely their empirical distributions).

library(transport)
x <- rnorm(100)
y <- rnorm(90)
wasserstein1d(x,y) 

Cheers, Dominic



Dominic Schuhmacher
Professor of Stochastics
University of Goettingen
http://www.dominic.schuhmacher.name

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Transport and Earth Mover's Distance

2017-03-08 Thread Schuhmacher, Dominic
Dear Lorenzo,

No, the code does not do what you are after. R-package transport is for point 
patterns and histograms in two and more dimensions. You have a distribution in 
one dimension.

> 1) two distributions with the same bins (I identify each bin by the
> central point in the bin).
> 
> n_bin <- 11 # number of bins
> 
> bin_structure <- seq(10, by=1, len=n_bin)
> 
> set.seed(1234)
> 
> x_counts <- rpois(n_bin, 10)
> y_counts <- rpois(n_bin, 10)
> 
> x <- pp(as.matrix(cbind(bin_structure, x_counts)))
# 11 points in two dimensions with x-coordinates bin_structure and 
y-coordinates x_counts 
> y <- pp(as.matrix(cbind(bin_structure, y_counts)))
# 11 points in two dimensions with x-coordinates bin_structure and 
y-coordinates y_counts 
> match <- transport(x,y,p=1)
# compute the optimal transport between the two point patterns (assuming mass 1 
at each point), i.e. simply match the points in 2-d
> plot(x,y,match)
> wasserstein_dist <- wasserstein(x,y,p=1,match)

# the average distance by which points are moved.

You could trick transport into solving one-dimensional problems. But given that 
one-dimensional optimal transport is computationally simple, this would be 
highly inefficient. Depending on what you want to do with „additional mass“ (in 
your examples, you have count data with differing total counts), the following 
might be what you want:

x_prob <- x_counts/sum(x_counts)
y_prob <- y_counts/sum(y_counts)
sum(abs(cumsum(x_prob)-cumsum(y_prob)))

This gives the 1-Wasserstein distance between the probability histograms (i.e. 
the empirical distributions), assuming as in your example equally spaced-bins 
of width 1. You need the same bins for both samples, but you can treat your 
Example 2 by adding bins with count 0 to x_counts and y_counts.




>> 
>> If you have no particular need for binning, check out the function
>> pppdist in the R-package spatstat, which offers a more flexible way
>> to deal with point patterns of different size.
> 
> 
> Well, this is not clear, but possibly very important for me.
> My raw data consists of 2 univariate samples of unequal length.
> 
> suppose that
> 
> x<-rnorm(100)
> 
> and
> 
> y<-rnorm(90)
> 
> Is there a way to define the Wasserstein distance between them without
> going through the binning procedure?
> 
Define, yes: the 1-Wasserstein distance in one-dimension is the area between 
the empirical cumulative distribution functions. If the samples had the same 
lengths this could be directly computed by

mean(abs(sort(x)-sort(y)))

Otherwise this needs some lines of code. I will include it in the next version 
of the transport package (soon).

Best regards,
Dominic



> Am 07.03.2017 um 16:32 schrieb Lorenzo Isella :
> 
> Dear Dominic,
> Thanks a lot for the quick reply.
> Just a few questions to make sure I got it all right (I now understand that
> transport and spatstat in particular can do much more than I need
> right now).
> Essentially I am after the Wasserstein distance between univariate
> distributions (and it would be great if I can extend it to the
> case of two distributions with a different bin structure).
> 
> 1) two distributions with the same bins (I identify each bin by the
> central point in the bin).
> 
> n_bin <- 11 # number of bins
> 
> bin_structure <- seq(10, by=1, len=n_bin)
> 
> set.seed(1234)
> 
> x_counts <- rpois(n_bin, 10)
> y_counts <- rpois(n_bin, 10)
> 
> x <- pp(as.matrix(cbind(bin_structure, x_counts)))
> y <- pp(as.matrix(cbind(bin_structure, y_counts)))
> 
> 
> match <- transport(x,y,p=1)
> plot(x,y,match)
> wasserstein_dist <- wasserstein(x,y,p=1,match)
> 
> 
> 2) Now I do not have the same bin structure
> 
> 
> y2 <- pp(as.matrix(cbind(bin_structure+2, y_counts)))
> 
> 
> match <- transport(x,y2,p=1)
> plot(x,y2,match)
> wasserstein_dist2 <- wasserstein(x,y2,p=1,match)
> 
> 
> Do 1) and 2) make sense?
> 
>> 
>> If you have no particular need for binning, check out the function
>> pppdist in the R-package spatstat, which offers a more flexible way
>> to deal with point patterns of different size.
> 
> 
> Well, this is not clear, but possibly very important for me.
> My raw data consists of 2 univariate samples of unequal length.
> 
> suppose that
> 
> x<-rnorm(100)
> 
> and
> 
> y<-rnorm(90)
> 
> Is there a way to define the Wasserstein distance between them without
> going through the binning procedure?
> 
> 
> 
> Many thanks!
> 
> Lorenzo


Prof. Dr. Dominic Schuhmacher
Institute for Mathematical Stochastics
University of Goettingen
Goldschmidtstrasse 7
D-37077 Goettingen
Germany

Phone: +49 (0)551 39172107
E-mail: dominic.schuhmac...@mathematik.uni-goettingen.de



__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible 

Re: [R] Transport and Earth Mover's Distance

2017-03-07 Thread Lorenzo Isella

Dear Dominic,
Thanks a lot for the quick reply.
Just a few questions to make sure I got it all right (I now understand that
transport and spatstat in particular can do much more than I need
right now).
Essentially I am after the Wasserstein distance between univariate
distributions (and it would be great if I can extend it to the
case of two distributions with a different bin structure).

1) two distributions with the same bins (I identify each bin by the
central point in the bin).

n_bin <- 11 # number of bins

bin_structure <- seq(10, by=1, len=n_bin)

set.seed(1234)

x_counts <- rpois(n_bin, 10)
y_counts <- rpois(n_bin, 10)

x <- pp(as.matrix(cbind(bin_structure, x_counts)))
y <- pp(as.matrix(cbind(bin_structure, y_counts)))


match <- transport(x,y,p=1)
plot(x,y,match)
wasserstein_dist <- wasserstein(x,y,p=1,match)


2) Now I do not have the same bin structure


y2 <- pp(as.matrix(cbind(bin_structure+2, y_counts)))


match <- transport(x,y2,p=1)
plot(x,y2,match)
wasserstein_dist2 <- wasserstein(x,y2,p=1,match)


Do 1) and 2) make sense?



If you have no particular need for binning, check out the function
pppdist in the R-package spatstat, which offers a more flexible way
to deal with point patterns of different size.



Well, this is not clear, but possibly very important for me.
My raw data consists of 2 univariate samples of unequal length.

suppose that

x<-rnorm(100)

and

y<-rnorm(90)

Is there a way to define the Wasserstein distance between them without
going through the binning procedure?



Many thanks!

Lorenzo

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Transport and Earth Mover's Distance

2017-03-07 Thread Schuhmacher, Dominic
Dear Lorenzo,

The code you have attached (if you replace 00 by 50) generates two sets of 50 
points each that are uniformly distributed on the unit square [0,1]^2 and then 
computes the Wasserstein-1 distance between them assuming that each point has 
mass 1/50. Do the following to get a plot of the optimal point assignment (the 
average of the line lengths is the Wasserstein-1 distance).

x <- pp(matrix(runif(100),50,2))
y <- pp(matrix(runif(100),50,2))
match <- transport(x,y,p=1)
plot(x,y,match)
wasserstein(x,y,p=1,match)

Currently the transport package only matches "objects" with the same total 
mass, i.e. point patterns (class pp) with equal numbers of points, weighted 
point patterns (class wpp) where e.g. the mass at each point is given by 1/(no. 
of points), and pixel grids (class pgrid) which are supposed to have the same 
total mass summed over all pixel values.

If you have no particular need for binning, check out the function pppdist in 
the R-package spatstat, which offers a more flexible way to deal with point 
patterns of different size. If you need to bin, check out transport.pgrid, but 
be aware that it does not do "incomplete transport", but will normalize both 
pixel grids to unit total mass if these masses are different.

Best regards,
Dominic



> Am 07.03.2017 um 13:29 schrieb Lorenzo Isella :
> 
> Dear All,
> From time to time I need to resort to the calculation of the earth
> mover' distance (see
> 
> https://en.wikipedia.org/wiki/Earth_mover's_distance and
> https://en.wikipedia.org/wiki/Wasserstein_metric .
> 
> In the past I used the package
> 
> https://r-forge.r-project.org/projects/earthmovdist/
> 
> which apparently is no longer available, but there is plenty of choice
> in R.
> 
> From the transport package, I found this example
> 
> set.seed(27)
> x <- pp(matrix(runif(100),50,2))
> y <- pp(matrix(runif(100),00,2))
> wasserstein(x,y,p=1)
> 
> but it is not 100% clear to me how to interpret it.
> Are x and y meant as histograms where the the center of each bin is
> provided and the total mass in the bins is automatically normalized to
> 1?
> 
> Essentially, my situation is that I have two univariate samples of unequal
> size. I would like to bin them and calculate the earth mover's
> distance between them.
> 
> I am not sure if this is what the example above does.
> Cheers
> 
> Lorenzo
> 
> 


Prof. Dr. Dominic Schuhmacher
Institute for Mathematical Stochastics
University of Goettingen
Goldschmidtstrasse 7
D-37077 Goettingen
Germany

Phone: +49 (0)551 39172107
E-mail: dominic.schuhmac...@mathematik.uni-goettingen.de

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.