Dear Nate,

although I learned from Phillippe's response about the existence of log1p, I don't think I will use it (for reasons below). Thierry's response is true for Poisson data, but not for non-integer values. Still, it points into an important direction: All too often zeros emanate from a different process than the other values (see mixed distributions, zero-inflated, hurdle and all that). In that case, you should consult Ben Bolker's excellent book (which is probably still available as a draft on his homepage, but also worth buying).

If you want to transform, here is my take:

My folk-law guidelines on the c in log(x+c) are:
1. c should roughly be 1/2 of the smallest, non-zero value: signif(0.5*sort(unique(x))[2], 2) 2. c should be quadrat of the first quantile devided by the third quantile: (quantile(x)[2]^2)/quantile(x)[4]
For example:
set.seed(11011)
x <- c(runif(95), rep(0,5))

Method 1: c=0.0015
Method 2: c=0.015
While this looks like a huge difference (an order of magnitude), it actually isn't all that much, given the range of the data:

plot(density(x))
abline(v=c(0.0015, 0.015))

I do have a reference for method 2, but it is German (Stahel, W. A. (2002) Statistische Datenanalyse. Eine Einführung für Naturwissenschaftler. Vieweg, Braunschweig.). _ Method 1 is what my PhD's statistics adviser recommended. Since he was right in everything else, I rely on his advise here, too. That may, I acknowledge, not be good enough for you. But maybe someone else finds a proper reference.

The key thing for any value of c is that it doesn't distort the analysis. But then, how do you detect distortion? I used a comparison of rank-transformed data and various values of c. When c was large (in the current example e.g. 0.5 or so), the analysis started to differ from the rank-analysis. To use log1p here would be a dramatic distortion!

Another way to look at it is through Box-Cox-transformation. Since Box-Cox transforms towards symmetric (not necessarily normal) distribution, also c should be chosen in such a way as to facilitate the transformation towards symmetry.

HTH,

Carsten


Nate Upham wrote:
I have a general stats question for you guys:

How does one normally deal with zero (0) values when log transforming data?
I would like to log transform (natural log, ln) several response variables for 
use in quantile
regression.  But one of my variables includes several zero values.  Since ln(0) 
= infinity, this is
not readily possible.  Is it best to remove all data with zero values?  Or 
should I add a very small
number to each value (e.g., 0.00001)?  This seems problematic.  Is there an 
easy way to address this
issue?

Thanks much for your help,
--Nate

_________________________________
Nathan S. Upham
Ph.D. student
Committee on Evolutionary Biology
University of Chicago
1025 E. 57th St., Culver 402
Chicago, IL 60637
nsup...@uchicago.edu

_______________________________________________
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology


--
Dr. Carsten F. Dormann
Department of Computational Landscape Ecology
Helmholtz Centre for Environmental Research-UFZ Permoserstr. 15
04318 Leipzig
Germany

Tel: ++49(0)341 2351946
Fax: ++49(0)341 2351939
Email: carsten.dorm...@ufz.de
internet: http://www.ufz.de/index.php?de=4205

_______________________________________________
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

Reply via email to