[R] producing histogram-like plot
Hi! I have a dataset that looks like this: 0.0 14 0.0 3 0.9 12 0.7315 0.782 1.0 15 0.3 2 0.328 ...and so on. I.e. a value between 0 and 1, and a number I would like to plot this in a histogram-like manner. I would like to have a set of bins, each 0.1 wide, and plot the sum of values in column 2 that falls within each bin. I.e, in this case I would like the first bin, 0.0, to have the value 17, the second, 0.1, to have the value 0 and so on, until the last bin which has the value 15. I am sadly uncertain of both how to sum these together, and also on which plot type to use. Thanks in advance! Karin -- Karin Lagesen, Ph.D. Centre for Ecological and Evolutionary Synthesis (CEES) University of Oslo, Dept. of Biology P.O. Box 1066 Blindern 0316 Oslo, Norway Ph. +47 22844132 Fax. +47 22854001 Email karin.lage...@bio.uio.no http://folk.uio.no/karinlag __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] plotting histograms/density plots in a triangular layout?
Hi! I have a set of 49 pairwise comparisons that I have done. From this I would like to plot either histograms or the density plots of the values I get. Now, I can plot one histogram per comparison, but I have problems getting the output I want. When plotting like I normally would do: histogram(~percid | orgA_orgB, data = alldata) I get the histograms next to eachother in a boxlike shape. However, since these are pairwise ( 7x7 ) I would like to have them placed in a triangular shape, like this: 1 x 2 x x 3 x x x 1 3 3 where the Xes represent where I want plots, and the 1,2,3 represent the legends. I have seen similar plots done by R, so I know it is possible, but the question is how :D TIA, Karin -- Karin Lagesen, Ph.D. Centre for Ecological and Evolutionary Synthesis (CEES) University of Oslo, Dept. of Biology P.O. Box 1066 Blindern 0316 Oslo, Norway Ph. +47 22844132 Fax. +47 22854001 Email karin.lage...@bio.uio.no http://folk.uio.no/karinlag __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] transparent concentric circles
I have a data set which I would like to plot as a set of concentric circles. The data represent a count of the number of characteristics shared by various elements - an example would look like this: 1 100 2 75 3 50 4 25 I.e. all four sets share 25 characteristics, three of them share 50 characteristics, and so on. I would like to plot these as concentric circles, with the circle size preferentially being proportional to the size of the number of elements (this is not a must, however). I would also like the colors of the circles to become stronger/deeper as we progress to the innermost circle (which would be the one containing the number of characteristics shared by all four). Can somebody point me to what I can use to do this? Thanks! Karin -- Karin Lagesen, PhD __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] performing function on data frame
David Hajage writes: > Hi Karin, > > I'm not sure I understand... Is this what you want ? > > d$y - mean(d$y)/sd(d$y) Yes, and also a bit no. Each column in my data frame represents one data set. For every element in this data set I want to know the z value for that element. I.e: I want to create a new data frame from the old data frame, where each element in the new data frame is newDF[i,j] = oldDF[i,j] - mean(d[,j]) / sddev(d[,j]) I could, I think, iterate like this over the data frame, but I keep thinking that one of the apply functions should be employed... Karin -- Karin Lagesen, Ph.D. karin.lage...@medisin.uio.no http://folk.uio.no/karinlag __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] performing function on data frame
Hi! First, pardon me if this is a faq. I think I should be using some sort of apply, but I am not managing to figure those out. I have a data frame similar to this: > d <- data.frame(x = LETTERS[1:5], y = rnorm(5), z = rnorm(5)) > d x y z 1 A 0.1605464 -0.2719820 2 B -0.9258660 1.2623117 3 C -0.3602656 1.5470351 4 D 1.2621797 1.2996500 5 E 0.6021728 0.5027095 > >From this I want to get a new data frame which contains the z scores based on the values found in each row. For instance for element [C,y], I would like to calculate (-0.3602656 - mean(column y)/stddev(column y)). Thanks! -- Karin Lagesen, Ph.D. karin.lage...@medisin.uio.no http://folk.uio.no/karinlag __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] xyplot key issue - line colors
I have a problem regarding the colors assigned to the lines in the key to an xy plot. I specify the plot like this: xyplot(numbers~sqrt(breaks)|moltype+disttype, groups = type, data = alldata, layout = c(3,2), type = "l" , lwd = 2, col = c("gray", "skyblue"), key = simpleKey(levels(alldata$type), points = FALSE, lines = TRUE, columns = 2, lwd = 2, col = c("gray", "skyblue"))) However, the lines in the key (the lines that indicates which line is which) are still blue and magenta, and not gray and skyblue. I have seen something about superposing lines on top of this somehow, but I couldn't figure out how to do it. Thanks! Karin -- Karin Lagesen, PhD student [EMAIL PROTECTED] http://folk.uio.no/karinlag __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] conversion of data for use within barchart
I have a data matrix like this: > data[1:10,] aaname grp cluster count 1 Ala All Singleton 432 2 Arg All Singleton 1239 3 Asn All Singleton 396 4 Asp All Singleton 152 5 Cys All Singleton 206 6 Gln All Singleton 370 7 Glu All Singleton 211 8 Gly All Singleton 594 9 His All Singleton 213 10Ile All Singleton44 where the cluster column has three levels. > levels(data$cluster) [1] "Array" "Singleton" "rRNA" > Now, I would like to plot this like this: barchart(aaname~count|grp, group = cluster, data = data, stack = TRUE) I am thus using the cluster as the grouping. I would like to plot the relative abundance within each grouping, such that the max level in my plot always is one (or 100). This would for instance mean for the Ala in the All grp that the Singleton cluster consitute lets say 40% of the Ala in the All grp, wheras the Singleton and rRNA makes up 20% each. In this case I would get in my plot a Singleton stretching to 40%, whereas the other two would be 20% each, all in all making 100%. I am uncertain of whether I am managing to describe what I want, so I hope somebody understands what I want! Thanks! Karin -- Karin Lagesen, PhD student [EMAIL PROTECTED] http://folk.uio.no/karinlag __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] xyplot and separate abline per plot
Hello list! I have a set of data like this: > alldata[1:5,] breaks numbers disttype moltypetype 1 0.0006598 Gapped Distances 5S Between species 2 0.407 0 Gapped Distances 5S Between species 3 0.8135228 Gapped Distances 5S Between species 4 1.220 0 Gapped Distances 5S Between species 5 1.6279702 Gapped Distances 5S Between species > levels(alldata$disttype) [1] "Gapped Distances" "Ungapped Distances" > levels(alldata$type) [1] "Between species" "Same species" > levels(alldata$moltype) [1] "16S" "23S" "5S" > Which I plot like this: xyplot(numbers~sqrt(breaks)|moltype+disttype, groups = type, data = alldata) This results in a plot consisting of six different panels. Now, I have a set of six different values that I would like to incorporate into these plots through a vertical line in each panel (one separate value per panel). I think I can do this through panel.abline somehow, but I don't know how to incoporate that into the xyplot command, and I don't know how to specify which values I want plotted in each plot. I hope I am able to convey what I want:) Thanks in advance, karin -- Karin Lagesen, PhD student [EMAIL PROTECTED] http://folk.uio.no/karinlag __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] xyplot questions - axis and plotting two things in same panel
"Deepayan Sarkar" <[EMAIL PROTECTED]> writes: > On 6/25/08, Franz Mueter <[EMAIL PROTECTED]> wrote: >> As for your first problem, try: >> >> xyplot(numbers~breaks|moltype, groups = type, data = alldata, type = "l") >> >> >> -Original Message- >> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On >> Behalf Of Karin Lagesen >> Sent: Wednesday, June 25, 2008 2:13 AM >> To: r-help@r-project.org >> Subject: [R] xyplot questions - axis and plotting two things in same panel > > [...] > >> I am also wondering about whether it is possible to change the x axis >> scale here. I have data going from 0 to 35, but most of the >> interesting stuff is between 0 and 5. Thus I am wondering if there is >> any way of specifying that the 0 to 5 range should take up 30 % of the >> x axis (or something like that) and gradually shrink the axis after >> that. I have tried doing log on the x axis, but I have a lot of zeros >> in my data set that really breaks everything. > > Rather than having the software support arbitrary axis > transformations, it would be simpler to transform the data; e.g., > > xyplot(numbers~asinh(breaks) | moltype, groups = type, data = alldata, > type = "l") > > That still leaves the problem of "nice" axis labels in the original > scale. That is in general a hard problem. For special cases, you can > specify explicit tick positions (in the transformed scale) and > associated labels (in the original scale) using the 'scales' argument. Since I search the archives of this list quite a lot, I thought I'd just post the results I came up with and which worked nicely for me. I ended up with using the sqrt transformation, which gave me a very nice plot. For the axis I used the scales argument, as follows: # the max value on the x axis for my data is 40 labels = seq(0,40, by = 5) atvalues = sqrt(seq(0,40, by = 5)) #the plot itself xyplot(numbers~sqrt(breaks)|moltype, groups = type, data = alldata, type = "l", scales = list(at = atvalues, labels = labels)) worked like a charm! Thanks a lot for your help! Karin -- Karin Lagesen, PhD student [EMAIL PROTECTED] http://folk.uio.no/karinlag __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] xyplot questions - axis and plotting two things in same panel
Hi list! I am trying to use xyplot to plot some graphs. The data I have looks like this: > alldata[1:10,] breaks numbers moltypetype 1 0.0006598 5S Between species 2 0.407 0 5S Between species 3 0.8135228 5S Between species 4 1.220 0 5S Between species 5 1.6279702 5S Between species 6 2.033 0 5S Between species 7 2.4407834 5S Between species 8 2.847 0 5S Between species 9 3.253 12084 5S Between species 10 3.660 24 5S Between species > where moltype and type are factors, moltype having three different levels, and type having two. I am now plotting things like this: xyplot(numbers~breaks|moltype+type, data = alldata, type = "l") which gives me six panels showing just what I want. Now, my first problem is how to plot two graphs in the same panel. I would like one panel per moltype, but I want the type factor to result in two graphs plotted in the same panel. How do I specify this? I am also wondering about whether it is possible to change the x axis scale here. I have data going from 0 to 35, but most of the interesting stuff is between 0 and 5. Thus I am wondering if there is any way of specifying that the 0 to 5 range should take up 30 % of the x axis (or something like that) and gradually shrink the axis after that. I have tried doing log on the x axis, but I have a lot of zeros in my data set that really breaks everything. Thanks for your help! Karin -- Karin Lagesen, PhD student [EMAIL PROTECTED] http://folk.uio.no/karinlag __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] using the stepfun to plot histogram outline.
Hello list:) I have lots of values which I would like to get a histogram outline out of. An example of what I am talking about: testdata = runif(100) bbb = seq(0,1, by = 0.01) hist(testdata, breaks = bbb) I would like to get the outline of the resulting histogram. Now, I think that I can do this using the stepfun function. However, I am uncertain of how to get to the data the stepfun function require. >From ?stepfun Arguments: x: numeric vector giving the knots or jump locations of the step function for 'stepfun()'. For the other functions, 'x' is as 'object' below. y: numeric vector one longer than 'x', giving the heights of the function values _between_ the x values. X I think is the same as bbb above. I am however uncertain of how I would go about getting the data needed for y, given that the data I have is on the same format as testdata above is. Thanks for your help! Karin -- Karin Lagesen, PhD student [EMAIL PROTECTED] http://folk.uio.no/karinlag __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] vector comparison
I know this is fairly basic, but I must have somehow missed it in the manuals. I have two vectors, often of unequal length. I would like to compare them for identity. Order of elements do not matter, but they should contain the same. I.e: I want this kind of comparison: > if (1==1) show("yes") else show("blah") [1] "yes" > if (1==2) show("yes") else show("blah") [1] "blah" > Only replace the numbers with for instance the vectors > a = c("a") > b = c("b","c") > c = c("c","b") Now, I realize I only get a warning when comparing things, but this to me means that I am not doing it correctly: > if (a==a) show("yes") else show("blah") [1] "yes" > if (a==b) show("yes") else show("blah") [1] "blah" Warning message: In if (a == b) show("yes") else show("blah") : the condition has length > 1 and only the first element will be used > > if (b == c) show("yes") else show("blah") [1] "blah" Warning message: In if (b == c) show("yes") else show("blah") : the condition has length > 1 and only the first element will be used > I have also tried the %in% comparator, but that one throws warnings too: > if (b %in% c) show("yes") else show("blah") [1] "yes" Warning message: In if (b %in% c) show("yes") else show("blah") : the condition has length > 1 and only the first element will be used > > if (c %in% c) show("yes") else show("blah") [1] "yes" Warning message: In if (c %in% c) show("yes") else show("blah") : the condition has length > 1 and only the first element will be used > So, how is this really supposed to be done? Thanks! Karin __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] figuring out the results from hclust
I have two examples that I run hclust on: a = c(0,1,1.5,1.5) b = c(1,0,1.5,1.5) c = c(1.5,1.5,0,0.5) d = c(1.5,1.5,0.5,0) ll = as.matrix(rbind(a,b,c,d)) test = as.dist(ll) long = hclust(test) a = c(0,0.3,1,1) b = c(0.3,0,1,1) c = c(1,1,0,0.5) d = c(1,1,0.5,0) ll = as.matrix(rbind(a,b,c,d)) test = as.dist(ll) short = hclust(test) The main difference between them is whether a and b gets clustered higher up or lower down than the b,c cluster. I am working on partitioning this kind of data into three clusters. I know I can do that with cutree. The result I get from that is the following: > cutree(short, k=3) a b c d 1 1 2 3 > cutree(long, k=3) a b c d 1 2 3 3 > And I can also access the height matrix for both: > short$height [1] 0.3 0.5 1.0 > long$height [1] 0.5 1.0 1.5 > So I know at what heights they get merged. What I seem to be unable to get at is which one of the clusters as shown by cutree correspond to what split. When I examine short in a plot I can easily see that the highest split (i.e corresponding to the last height, 1, in the height matrix) is between the cutree clusters 1 and 2,3. In the long example this split is between 1,2 and 3. I would however like to not examine all of the data I have by hand:) Could any of you point me to what I need to do to get at this data? I have tried to examine the merge data in both cases, but I am coming up short. Thanks! Karin -- Karin Lagesen, PhD student [EMAIL PROTECTED] http://folk.uio.no/karinlag __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] deconvoluting hclust objects
I have a hclust object that looks like this: > test77 Call: hclust(d = input) Cluster method : complete Number of objects: 11 > test77$height [1] 0.000 0.000 0.000 0.000 0.000 0.000 0.900 [8] 0.9473684 1.7894737 8.5771948 > test77$merge [,1] [,2] [1,] -1 -2 [2,] -31 [3,] -72 [4,] -83 [5,] -4 -6 [6,] -5 -10 [7,] -95 [8,] -114 [9,]78 [10,]69 > I am specifically interested in what happens when you divide this object into three clusters. When I look at the plot, the three clusters form like this: (monospace font) -- || | | | | 2 3 6 What I am wanting is to get distance information on the three groups, and also the number of objects in each. The distances I need are in height, but I don't know how to get at the sizes of the subtrees. I know I can use cutree to get at the cut, but I cannot see anything systematic in which group becomes group no 1 and so forth. The results I want would in this example be: 236 8.5771948 1.7894737 Any hints for me? Thanks! Karin -- Karin Lagesen, PhD student [EMAIL PROTECTED] http://folk.uio.no/karinlag __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] problems with data frames, factors and lists
I have a function that creates a list based on some clustered data: mix <- function(Y, pid) { hc = gethc(Y,pid) maxheight = max(hc$height) noingrp = processhc(hc) one = noingrp$one two = noingrp$two twoisone = "one" if (two != 1) twoisone = "more" out = list(pid = pid,one = noingrp$one, two = noingrp$two, diff = maxheight, noseqs = length(hc$labels), twogrp = twoisone) return(out) } example result: > mix(tsus_same, 77) $pid [1] 77 $one [1] 9 $two [1] 2 $diff [1] 8.577195 $noseqs [1] 11 $twogrp [1] "more" > I then use this function in another function that just runs this function through a lot of data: doset <- function(sameset) { pids = unique(c(sameset$APID, sameset$BPID)) for (f in pids) { oputframe = data.frame(rbind(oputframe, mix(sameset, f))) } return(oputframe) } All values except $twogrp are numbers. There are two possible values for $twogrp, "one" and "more". the first one is more common and gets added to the data frame first. The result is that I cannot add the rows where this is "more" without getting 38: In `[<-.factor`(`*tmp*`, ri, value = "more") : invalid factor level, NAs generated Now, this is a pain in the neck. How can I merge these lists to the data frame and still have the value $twogrp as a factor? Thanks, and I hope my code makes some sense! Karin -- Karin Lagesen, PhD student [EMAIL PROTECTED] http://folk.uio.no/karinlag __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cloud plot has white(transparent?) background
I am using the code example from the R graph gallery to look at a cloud plot: require(lattice) data(iris) print(cloud(Sepal.Length ~ Petal.Length * Petal.Width, data = iris, groups = Species, screen = list(z = 20, x = -70), perspective = FALSE, key = list(title = "Iris Data", x = .15, y=.85, corner = c(0,1), border = TRUE, points = Rows(trellis.par.get("superpose.symbol"), 1:3), text = list(levels(iris$Species) Now, in the example on the webpage this comes out with a nice gray background that makes things easier to see. Mine comes out with a white, potentially transparent background and also the point colors have changed. How do I get the nice gray color back? Thanks! Karin -- Karin Lagesen, PhD student [EMAIL PROTECTED] http://folk.uio.no/karinlag __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] adding bwplot to existing bwplot
Hello. I have made many normal boxplots where I have added a new boxplot to an existing one. When I have done this, I have used the at command to move the boxplots a bit so that they could fit next to eachother, like this: boxplot(data.., at = number_of_categories-0.15) boxplot(data.., at = number_of_categoreis+0.15, add =TRUE) Now I am wondering if it is possible to do the same in some way with bwplot. The data I want to plot is like this: > operonthings[1:5,] phylum pid type no_clust no_seqs 1 Acidobacteria 15771 5S1 1 2 Acidobacteria 12638 5S1 2 3 Actinobacteria 16321 5S2 6 4 Actinobacteria92 5S2 2 5 Actinobacteria87 5S1 5 > where phylum and types are the factors I would like to plot no_clust and no_seqs against.I basically want these in the same plot: bwplot(no_clust~type|phylum, data = operonthings) and bwplot(no_seqs~type|phylum, data = operonthings) Any thoughts on how to do this? Thanks! Karin -- Karin Lagesen, PhD student [EMAIL PROTECTED] http://folk.uio.no/karinlag __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] hclust graphics - plotting many points
Hello. I have a distance matrix with lots of distances that I use hclust to organise. I then plot the results using the plot method of hclust. However, the plot itself takes around 20 mins to make due to there being ~700 things in the matrix that I have distances for. I thus would like to dump this to some graphics format which will let me examine this further. I tried dumping it to postscript: postscript("myfile.ps", height = 50, pointsize=5) plot(my_hc_object) dev.off() What happens is that since most of the items in the matrix have a distance of zero to something everything just becomes a black smear on the bottom where I cannot distinguish anything from anything else. I thus tried increasing the heigth and/or width and also downscaling the pointsize. None of these improved anything much. So, now I am wondering if any of you have any tips for how I can get something like I get in the x11() window which I can also store and potentially show other people. Thanks! Karin -- Karin Lagesen, PhD student [EMAIL PROTECTED] http://folk.uio.no/karinlag __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] combine vector and data frame on field?
I have managed to create a data frame like this: > tsus_same_mean[1:10,] PIDGrpDist PercAlnPercId 1 12638 Acidobacteria 0.0 1.000 1.000 2 87 Actinobacteria 0.0 0.970 0.970 3 92 Actinobacteria 0.008902000 1.000 0.991 4 94 Actinobacteria 0.0 1.000 1.000 5189 Actinobacteria 0.005876733 0.973 0.9676667 6242 Actinobacteria 0.001734200 0.973 0.9715333 7305 Actinobacteria 0.0 0.970 0.970 8307 Actinobacteria 0.0 0.970 0.970 9328 Actinobacteria 0.0 1.000 1.000 10 10689 Actinobacteria 0.0 1.000 1.000 > and what I think is a factor like this: > tsuPIDCount[1:10] 3 4 5 8 9 12 13 15 18 19 2 2 2 3 4 7 4 2 2 3 > Now, I'd like to combine the two. The factor levels in tsuPIDCount corresponds to the field called PID in the data frame. Any hints on how to do this? cbind just adds the vector onto the end, and I couldn't quite figure out if I could somehow say that the level should correspond to the PID. Thanks a lot for your helpin advance:) Karin -- Karin Lagesen, PhD student [EMAIL PROTECTED] http://folk.uio.no/karinlag __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] clustering problem
First I just want to say thanks for all the help I've had from the list so far..) I now have what I think is a clustering problem. I have lots of objects which I have measured a dissimilarity between. Now, this list only has one entry per pair, so it is not symmetrical. Example input: NameA NameB Dist 189_1C2 189_1C1 0 189_1C3 189_1C1 0.017 189_1C3 189_1C2 0.017 189_1C4 189_1C1 0 189_1C4 189_1C2 0 189_1C4 189_1C3 0.017 189_1C5 189_1C1 0.05 189_1C5 189_1C2 0.05 189_1C5 189_1C3 0.067 189_1C5 189_1C4 0.05 189_1C6 189_1C1 0.05 189_1C6 189_1C2 0.05 189_1C6 189_1C3 0.067 189_1C6 189_1C4 0.05 189_1C6 189_1C5 0 The distance measure is 0 if identical, and then increases with increasing dissimilarity up till 1. What I would like to get from these data is a hierarchical clustering graph. In this example I would then group 189_1C2 189_1C1 189_1C4, 189_1C6 189_1C5, and 189_1C3 off with itself. The distances between the groups should be the mean distances between the objects within each group (I think). I have looked at hclust and it seems like it should be able to do what I want. However, I am unsure of how to use it to get what I am looking for. Thankyou in advance for your help! Karin -- Karin Lagesen, PhD student [EMAIL PROTECTED] http://folk.uio.no/karinlag __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] tabulation on dataframe question
I have a data frame with data similar to this: NameA GrpA NameB GrpB Dist A Alpha B Alpha 0.2 A Alpha C Beta0.2 A Alpha D Beta0.4 B Alpha C Beta0.2 B Alpha D Beta0.1 C Beta D Beta0.3 Dist is a distance measure between two entities. The table displays all to all distances, but the distance between two entities only appears once. What I would like to get is a table where I get a count of entities per group where the distance satisfies a certain condition ( equal to zero for instance). In this case, if the requirement was Distances == 0.2 AlphaBeta Alpha 12 Beta20 This resulting table would be symmetrical. I hope I am able to convey what I would like, and TIA for your help! Karin -- Karin Lagesen, PhD student [EMAIL PROTECTED] http://folk.uio.no/karinlag __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] extracting rows from dataframe that match a vector
Hi! I have a large dataframe that I want to extract a subset from. This subset has a certain column value that matches elements in a vector I have defined. So, my question is how do I get the rows that match one of the elements in the vector. Example: a = c(1:5) b = letters[1:10] df = data.frame(ind = a, letrs = b) > df ind letrs 11 a 22 b 33 c 44 d 55 e 61 f 72 g 83 h 94 i 10 5 j > # Now I want to extract all of the rows where ind == 2, 4 or 5. # This would be rows 2, 4, 5, 7, 9 and 10 subgr = c(2,4,6) My most natural inclination would be to do df[df$ind == subgr,] However, this does not work: > df[df$ind == subgr,] ind letrs 7 2 g Warning message: In df$ind == subgr : longer object length is not a multiple of shorter object length > So, which part of this is it that I have misunderstood? Thanks for your help btw! Karin -- Karin Lagesen, PhD student [EMAIL PROTECTED] http://folk.uio.no/karinlag __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] accessing the indices of outliers in a data frame boxplot
I have a data frame containing columns which are factors. I use this to make boxplots for the data, with one box per factor. I would now like to get at the data in the data frame which corresponds to the outliers. I have so far found the $out, which gives "the values of any data points which lie beyond the extremes of the whiskers", but I haven't found anything which will let me get at the indices in the original data frame for these outliers. I think there might be a chance that I could simply compare the values I am plotting from my data frame with the values for the whiskers and use that as a criteria, but I am unsertain of how to do this withhout doing it manually. The factor I am plotting against contains 17 levels, and I'd thus like to see if there is a somewhat more general solution available. Thanks for your help! Karin -- Karin Lagesen, PhD student [EMAIL PROTECTED] http://folk.uio.no/karinlag __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] contingency table on data frame
I am sorry if this is a faq or tutorial somewhere, but I am unable to solve this one. What I am looking for is a count of how many different categories(numbers in this case) that appears for a given factor. Example: > l <- c("Yes", "No", "Perhaps") > x <- factor( sample(l, 10, replace=T), levels=l ) > m <- c(1:5) > y <- factor( sample(m, 10, replace=T), levels=m ) > z = c(1:10) > my_df = data.frame("Z" = z, "Y"= y, "X" = x) > my_df Z Y X 1 1 4 Yes 2 2 1 No 3 3 2 Perhaps 4 4 3 Yes 5 5 4 No 6 6 5 No 7 7 1 Yes 8 8 4 Perhaps 9 9 4 Yes 10 10 2 Perhaps > I am now looking for a table that will give me this: Yes 3 # Yes has these ys: 4,3,1,4, two are the same, ergo 3 No 3 # No has these ys: 1,4,5 Perhaps 2 # Perhaps has these ys: 2,4,2 My dataframe has lots of other colums too, but I only want this information out. Thankyou for your help! Karin -- Karin Lagesen, PhD student [EMAIL PROTECTED] http://folk.uio.no/karinlag __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] subsetting a data frame using string matching
Example data frame: a = c("Alpha", "Beta", "Gamma", "Beeta", "Alpha", "beta") b = c(1:6) example = data.frame("Title" = a, "Vals" = b) > example Title Vals 1 Alpha1 2 Beta2 3 Gamma3 4 Beeta4 5 Alpha5 6 beta6 > I would like to be able to get a new data frame from this data frame containing only rows that match a certain string. In this case it could for instance be the string "eta". I have tried various ways of using agrep and grep, but so far I have not found anything that worked. Thankyou in advance for your help! Karin -- Karin Lagesen, PhD student [EMAIL PROTECTED] http://folk.uio.no/karinlag __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] creating summary functions for data frame
I have a data frame that looks like this: > gctablechromonly[1:5,] refseq geometry gccontent X60_origin X60_terminus length kingdom 1 NC_009484 cir0.6799179 773000 3389227 Bacteria 2 NC_009484 cir0.6799179 773000 3389227 Bacteria 3 NC_009484 cir0.6799179 773000 3389227 Bacteria 4 NC_009484 cir0.6799179 773000 3389227 Bacteria 5 NC_009484 cir0.6799179 773000 3389227 Bacteria grp feature gene begin dir gc_content replicor LEADLAG 1 Alphaproteobacteria CDS CDS 261 + 0.654244RIGHTLEAD 2 Alphaproteobacteria CDS CDS 1737 - 0.651408RIGHT LAG 3 Alphaproteobacteria CDS CDS 2902 + 0.607843RIGHTLEAD 4 Alphaproteobacteria CDS CDS 3693 + 0.617647RIGHTLEAD 5 Alphaproteobacteria CDS CDS 4227 + 0.699208RIGHTLEAD > About half of these columns are factors, for instance refseq, kingdom, grp and feature. Now, I have seen that I can do by(gctablechromonly, gctablechromonly$feature, summary) to get useful information. However, I a wondering how I can write my own functions to get what I'd like. For instance, how could I get a table with grp as rows down the right, feature on the top, and a count of each kind of feature within each grp? I realize that this is probably pretty easy to do, but I do not know enough R yet to know which words to look for in the mail archives...:) TIA, Karin -- Karin Lagesen, PhD student [EMAIL PROTECTED] http://folk.uio.no/karinlag __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] "continuous" boxplot?
I have two vectors x and y, which I would like to plot against each other. I am also displaying other data in this plot. However, I have about 1 million points to plot, and just plotting them x againt y is not very informative. What I'd like to do is to do sort of a continuous box plot. My x values goes from -1 to 1 and my y values from 0 to 1, so I´d like to plot the median and quantiles, and possibly also all of the outliers somehow. Are there any facilities in R for doing something like this, or would I need to do this the hard coded way? Thankyou very much for your help! Karin -- Karin Lagesen, PhD student [EMAIL PROTECTED] http://folk.uio.no/karinlag __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] How to choose name for package during install (was:Re: problem loading hexbin associated package colorspace)
Prof Brian Ripley <[EMAIL PROTECTED]> writes: > Where does this 'hexbin' package come from? The one I have installed > (and the only one I found) is from BioC, and that does not depend on > colorspace: > > Description: > > Package: hexbin > Version: 1.10.0 > Date: 2006-09-28 > Depends: R (>= 2.0), methods, grid, lattice I have now discovered that I had an old hexbin version installed, one that did require colorspace. I am now trying to install the 1.10 version, however, I cannot get it properly loaded since library grabs the first one in the list. So now for my next question: During R CMD INSTALL, how do I specify that a package should not just be named "hexbin" but for instance "hexbin_1.10" so that I can actually tell library to get the correct one. (this seems to me to be the way the library help file tells me that I should solve this problem). > That said, something is wrong with your installation of colorspace, so > I suggest you reinstall it. Reinstalled, does still not load properly: alanine[15:01]:~/work/rna_comparison/scripts/rpackages> cat colorspace/DESCRIPTION Package: colorspace Version: 0.95 Date: 2006-11-16 Title: Colorspace Manipulation Author: Ross Ihaka <[EMAIL PROTECTED]> Maintainer: Ross Ihaka <[EMAIL PROTECTED]> Depends: R (>= 2.0.0), methods Description: Carries out mapping between assorted color spaces. License: BSD URL: http://www.r-project.org LazyLoad: yes Packaged: Thu Nov 16 11:47:26 2006; ihaka Built: R 2.5.1; x86_64-unknown-linux-gnu; 2007-09-27 12:58:16; unix alanine[15:02]:~/work/rna_comparison/scripts/rpackages> > library(colorspace) Error in loadNamespace(package, c(which.lib.loc, lib.loc), keep.source = keep.source) : in 'colorspace' methods for export not found: [, coords, plot Error: package/namespace load failed for 'colorspace' > Karin -- Karin Lagesen, PhD student [EMAIL PROTECTED] http://folk.uio.no/karinlag __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] problem loading hexbin associated package colorspace
I have lots of data that I need to display, and I think hexbin would be good for it. However, I cannot load one of the requried packages associated with the hexbin package: > library(hexbin) Loading required package: colorspace Error in loadNamespace(package, c(which.lib.loc, lib.loc), keep.source = keep.source) : in 'colorspace' methods for export not found: [, coords, plot Error: package 'colorspace' could not be loaded > library(colorspace) Error in loadNamespace(package, c(which.lib.loc, lib.loc), keep.source = keep.source) : in 'colorspace' methods for export not found: [, coords, plot Error: package/namespace load failed for 'colorspace' > sessionInfo() R version 2.5.1 (2007-06-27) x86_64-unknown-linux-gnu locale: C attached base packages: [1] "grid" "stats" "graphics" "grDevices" "utils" "datasets" [7] "methods" "base" other attached packages: lattice "0.14-9" > The colorspace package is version 0.95. Is this an error with my system, this code, or something else? Thanks for having this list btw...:) Karin -- Karin Lagesen, PhD student [EMAIL PROTECTED] http://folk.uio.no/karinlag __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Am I misunderstanding the ifelse construction?
I have a function like this: changedir <- function(dataframe) { dir <- dataframe$dir gc_content <- dataframe$gc_content d <- ifelse(dir == "-", gc_content <- -gc_content,gc_content <- gc_content) return(d) } The goal of this function is to be able to input a data frame like this: > lala dir gc_content 1+0.5 2-0.5 3+0.5 4-0.5 5+0.5 6-0.5 7+0.5 8-0.5 9+0.5 10 -0.5 11 +0.5 12 -0.5 13 +0.5 14 -0.5 15 +0.5 16 -0.5 17 +0.5 18 -0.5 19 +0.5 20 -0.5 > And change the sign of the value of the gc_content field if the corresponding dir field is negative. Howver, when I run this through the changedir function, all of the gc_contents become negative. An I misunderstanding how to use the ifelse construct? And in that case, how should I go about doing this in a different way? Thankyou very much in advance for your help, and I hope that my question is not too banal! Karin -- Karin Lagesen, PhD student [EMAIL PROTECTED] http://folk.uio.no/karinlag __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] function on factors - how best to proceed
Sorry about this one being long, and I apologise beforehand if there is something obvious here that I have missed. I am new to creating my own functions in R, and I am uncertain of how they work. I have a data set that I have read into a data frame: > gctable[1:5,] refseq geometry X60_origin X60_terminus length kingdom 1 NC_009484 cir179 773000 3389227 Bacteria 2 NC_009484 cir179 773000 3389227 Bacteria 3 NC_009484 cir179 773000 3389227 Bacteria 4 NC_009484 cir179 773000 3389227 Bacteria 5 NC_009484 cir179 773000 3389227 Bacteria grp feature gene begin dir gc_content replicor LEADLAG 1 Alphaproteobacteria CDS CDS 261 + 0.654244RIGHTLEAD 2 Alphaproteobacteria CDS CDS 1737 - 0.651408RIGHT LAG 3 Alphaproteobacteria CDS CDS 2902 + 0.607843RIGHTLEAD 4 Alphaproteobacteria CDS CDS 3693 + 0.617647RIGHTLEAD 5 Alphaproteobacteria CDS CDS 4227 + 0.699208RIGHTLEAD > Most of these columns are factors. Now, I have a function that I would like to employ on this data frame. Right now I cannot get it to work, and that seems to be due to the columns in the data frame being factors. I tested it with a data frame created from vectors, and it worked fine. The function: percentdistance <- function(origin, terminus, length, begin, replicor){ print(c(origin, terminus, length, begin, repl)) d = 0 if (terminus>origin) { if(replicor=="LEFT") { d = -((origin-begin)%%length) } else { d = (begin-origin) } } else { if (replicor=="LEFT") { d=(origin-begin) } else{ d = -((begin-origin)%%length) } } d/length*2 } The error I get: > percentdistance(gctable$X60_origin, gctable$X60_terminus, gctable$length, > gctable$begin, gctable$replicor) [1] 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 [19] 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 [37] 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 [55] 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 [73] 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 [91] 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 [109] 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 [127] 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 .[99919] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 [99937] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 [99955] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 [99973] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 [1] 2 2 2 2 2 2 2 2 2 [ reached getOption("max.print") -- omitted 8526091 entries ]] Error in if (terminus > origin) { : missing value where TRUE/FALSE needed In addition: Warning messages: 1: > not meaningful for factors in: Ops.factor(terminus, origin) 2: the condition has length > 1 and only the first element will be used in: if (terminus > origin) { > This worked nice when the input were columns from a data frame created from vectors. I have also tried the different apply-functions, although I am uncertain of which one would be appropriate here. I would like to use this function to create a new data frame which would look something like this: new_frame = (gctable$feature, gctable$gene, gctable$kingdom, gctable$grp, gctable$gc_content, percentdistance(gctable)) I am uncertain of how to proceed. Should I deconstruct the data frame within the function, or should I get just the numbers out of the factors and input that into the function? Or is my solution way off from how things are done in R? Thankyou very much for your help! Karin -- Karin Lagesen, PhD student [EMAIL PROTECTED] http://folk.uio.no/karinlag __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] add group to boxplot
I have two sets of data in data frames (different dimensions). Now, I am able to make boxplots for one of these. I am using a formula, which gives me three boxplots which is automatically placed at at = c(1:3), since there are three groups. However, I would now like to add data from the other data frame, at position 4 in the box plot. Is there any way of telling the first boxplot command that this should be allowed? Hopefully an example that expresses what I need: > data(InsectSprays) > boxplot(count~spray, data = InsectSprays) > Aspray = subset(InsectSprays, spray == "A") > Aspray[] = lapply(Aspray, function(x) if (is.factor(x)) factor(x) else x) Now, I want to add Aspray as a new boxplot at the end of the existing plot, What do I do then? Karin -- Karin Lagesen, PhD student [EMAIL PROTECTED] http://folk.uio.no/karinlag __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.