Apologies, Jim Holtman has pointed out a couple of problems/queries with my original email that I would like to make clear.
Firstly, I introduced a typo when trying to be helpful. In my email below, I had incorrectly typed out one of the species codes I would count: 10000000 16220602 20110000 24000000 40320203 ## This should have been 40210102 45140000 45630600 == 7 "species" present. Secondly, the criteria I laid out might suggest that in the 10 rows of example I quoted, I would count both: 45630000 45630600 This is not what I wanted and apologies that this was not clear. I only want to count 45630600 because this is more "specific" in terms of what creature this is than 45630000. I don't know that 45630000 is not 45630600, so I should not count both 45630000 and 45630600, as this could be double accounting. These data are species counts and sometimes it is not possible to identify an individual to species level. Sometime we can't even get the genera, or even family, hence why sometimes we have a count for the family (45630000) as well as for the genus (45630600) in the same sample/site. It depends on how much of the individual there is to identify it from as to how precise the identification is. So I only want to count a higher level category only if I have not counted a lower level category contained within this higher level. I hope this is a little bit clearer? And no, I did not come up with this coding system nor the idea to use "counts" of "species" in this way... ;-) Apologies if my original email caused unnecessary confusion. All the best, G On Wed, 2009-02-18 at 13:37 +0000, Gavin Simpson wrote: > Dear List, > > I have a data set stored in the following format: > > > head(dat, n = 10) > id sppcode abundance > 1 10307 10000000 1 > 2 10307 16220602 2 > 3 10307 20000000 5 > 4 10307 20110000 2 > 5 10307 24000000 1 > 6 10307 40210000 83 > 7 10307 40210102 45 > 8 10307 45140000 1 > 9 10307 45630000 1 > 10 10307 45630600 41 > > str(dat) > 'data.frame': 111 obs. of 3 variables: > $ id : Factor w/ 3 levels "10307","10719",..: 1 1 1 1 1 1 1 1 1 1 ... > $ sppcode : chr "10000000" "16220602" "20000000" "20110000" ... > $ abundance: num 1 2 5 2 1 83 45 1 1 41 ... > > that represent counts of species, recorded with a particular coding > system. The abundance column is not needed for this particular > operation, but is present in the data files. > > I am interested in counting entries (rows) in the sppcode component of > dat. The sppcode takes a particular format: Order Family Genus Species, > with 2 alphanumeric digits allocated for each level of the hierarchy. I > want to know how many species there are in each site (the id factor), > but I should only count a higher level entry if there are no lower > levels present. > > For example, for the above data excerpt (just the headed rows), I would > count the following rows: > > 10000000 > 16220602 > 20110000 > 24000000 > 40320203 > 45140000 > 45630600 == 7 "species" present. > > To be more specific, I don't count 45630000 (row 9) because there exists > a sppcode for this 'id' where either of the next two pairs of digits are > not all 0's. > > In words, I want to count all rows where WWXXYYZZ are ZZ != 00, then, > rows where ZZ == 00 only if the WWXXYY combination has not been counted > yet. > > An example data set has been placed in my University web space and can > be read into R with the following: > > ## read example csv data > dat <- > read.csv(url("http://www.homepages.ucl.ac.uk/~ucfagls/files/example_data.csv"), > colClasses = c("factor","character","numeric")) > ## show the data > head(dat, n = 10) > > And the sppcode variable can be broken out into the 4 levels if required via: > > ## split out the four levels of categorisation: > dat2 <- data.frame(dat, > order = with(dat, substr(sppcode, 1, 2)), > family = with(dat, substr(sppcode, 3, 4)), > genus = with(dat, substr(sppcode, 5, 6)), > species = with(dat, substr(sppcode, 7, 8))) > > The actual data set/problem contains several hundred different id's. > > I can't see an efficient way of processing these data in the manner > described. Any help would be most gratefully received. > > Many thanks, > > Gavin > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Dr. Gavin Simpson [t] +44 (0)20 7679 0522 ECRC, UCL Geography, [f] +44 (0)20 7679 0565 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ UK. WC1E 6BT. [w] http://www.freshwaters.org.uk %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
signature.asc
Description: This is a digitally signed message part
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.