[R] stacked and dodged bar graph ggplot

2016-12-30 Thread Robert Lynch
I have some census data with race and ethnicity for various towns.  I am
trying to make a stacked bar graph where all the race data is in one
stacked bar, and all the ethnicity data is in another.

Below is a minimal reproducible sample.

library("ggplot2")

Demog <-
data.frame(source=c(rep("Davis",4),rep("Dixon",4),rep("Winters",4)),
group =c("Asian / Pacific Islander","Caucasian","Lantinx","Not
Latinx","African American", "Native American", "Latinx", "Not
Latinx","Mixed race","Other","Latinx", "Not Latinx"),
number =c(14491, 42571, 8172, 57450, 562, 184, 7426, 10952,
332, 1488, 3469, 3155),
field = rep(c(rep("race",2),rep("ethnicity",2)),3))

Demog$race <- factor(Demog$group, levels=c("Asian / Pacific Islander",
"Caucasian", "African American", "Native American / Alaska Native",  "mixed
race",  "other"))
Demog$ethn <- factor(Demog$group, levels=c("Latinx","not latinx"))
Demog$location <- factor(Demog$source, levels=c( "Dixon",
"Winters","Davis"))
Demog.bar1 <-ggplot(data = Demog, aes(x = location, y = number, fill =
race))+theme_bw() +geom_bar(stat = "identity",position = "stack") +
coord_flip()

Demog.bar2 <-ggplot(data = Demog, aes(x = location, y = number, fill =
ethn))+theme_bw() +geom_bar(stat = "identity",position = "stack") +
coord_flip()

show(Demog.bar1)
show(Demog.bar2)




Much thanks,
Robert

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] anova.lme

2014-07-21 Thread Robert Lynch
I would like to know the sum of squares for each term in my model. I used
the following call to fit the model

fit.courseCross - lme(fixed= zGrade ~ Rep  + ISE
+P7APrior+Female+White+HSGPA+MATH+Years+Course+Course*P7APrior ,
  random= ~1|SID,
  data = Master.complete[Master.complete$Course != P7A,])

and called an anova on it and get:

anova(fit.courseCross)
numDF denDF   F-value p-value
(Intercept) 1 58161 1559.6968  .0001
Rep 1 58161  520.7263  .0001
ISE 1  6266   21.3713  .0001
P7APrior2 58161  358.4827  .0001
Female  1  6266   89.2614  .0001
White   1  6266  235.9984  .0001
HSGPA   1  6266 1156.4116  .0001
MATH1  6266 1036.1354  .0001
Years   1 58161  407.6096  .0001
Course 12 58161   68.9875  .0001
P7APrior:Course24 58161   10.2464  .0001

The documentation for anova.lme says:

When only one fitted model object is present, a data frame with the sums of
squares, numerator degrees of freedom, denominator degrees of freedom,
F-values, and P-values for Wald tests for the terms in the model (when
Terms and L are NULL), a combination of model terms (when Terms in not
NULL), or linear combinations of the model coefficients (when L is not
NULL).

noticeably absent is the sum of squares.

How do I get them?
Robert

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] CI for nlme predictions

2014-07-11 Thread Robert Lynch
I am running a mixed effects model with random intercepts

fit.courseCross - lme(fixed= zGrade ~ Rep  + ISE
+P7APrior+Female+White+HSGPA+MATH+Years+Course+Course*P7APrior ,
  random= ~1|SID,
  data = Master.complete[Master.complete$Course != P7A,])
where all variables are factors except for HSGPA, MATH and Years

I noticed that predict.lm has an option for standard error, but
predict.nlme does not.  I understand that this might be because there is a
difference between SE's that conditioned or not on random effects.

I have looked at this stack overflow question
http://stackoverflow.com/questions/14358811/extract-prediction-band-from-lme-fit
(extract-prediction-band-from-lme-fit) but do not understand what is being
done.

And would like to show the predicted fit of zGrade vs Years with a
confidence interval.
a-la ggplot's geom_smooth.  The particular intercept does not mater ( I
don't care what the intercept is, though given a choice I'd prefer grand
mean centered) I would be happy with either conditional on unconditional
CI's.

Robert

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] help with the nested anova formulas

2013-12-05 Thread Robert Lynch
 I am modeling grade as a function of membership in various cohorts.  There
are four cohorts.  (NONE, ISE07,ISE08,ISE09) and two times of cohorts
coded as ISE = TRUE (ISE0#) or FALSE (NONE).  There is clear co-linearity
but that is to be expected.



 running the following code

 CutOff -0
 fit.base - lme(fixed= zGrade ~ Rep + COHORT/ISE + P7APrior + Female +
White + HSGPA + MATH + AP_TOTAL + Years + EOP + Course,
 random= ~1|SID,
 data = share[share$GRADE = CutOff,])


 I get the following error

 Error in MEEM(object, conLin, control$niterEM) :
   Singularity in backsolve at level 0, block 1

 but if I take out the /ISE I get no error, simmilarly if I take out the
COHORT/.

 I want to test for the effects of the different cohorts within the ISE
subset and across ISE  NONE

I can send the data (the whole is too large) if you wish.

 Robert

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] help with the nested anova formulas

2013-12-05 Thread Robert Lynch
I am modeling grade as a function of membership in various cohorts.  There
are four cohorts.  (NONE, ISE07,ISE08,ISE09) and two times of cohorts
coded as ISE = TRUE (ISE0#) or FALSE (NONE).  There is clear co-linearity
but that is to be expected.

running the following code

CutOff -0
fit.base - lme(fixed= zGrade ~ Rep + COHORT/ISE + P7APrior + Female +
White + HSGPA + MATH + AP_TOTAL + Years + EOP + Course,
random= ~1|SID,
data = share[share$GRADE = CutOff,])


I get the following error

Error in MEEM(object, conLin, control$niterEM) :
  Singularity in backsolve at level 0, block 1

but if I take out the /ISE I get no error, simmilarly if I take out the
COHORT/.

I want to test for the effects of the different cohorts within the ISE
subset and across ISE  NONE.

Robert
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] getting p-value for comparing to gam's from gmcv

2013-10-30 Thread Robert Lynch
I am trying to compare two different GAM fits.
I have something like
Course.bam20 -bam(zGrade ~ Rep + ISE   + White + Female + Years + AP_TOTAL
+ MATH + HSGPA+ EOP + factor(P7APrior, ordered = FALSE)+s(Yfrm7A,k=20),
data= Course, na.action = na.exclude,samfrac =0.1)

Course.bam4 -bam(zGrade ~ Rep + ISE   + White + Female + Years + AP_TOTAL
+ MATH + HSGPA+ EOP + factor(P7APrior, ordered = FALSE)+s(Yfrm7A,k=4),
data= Course, na.action = na.exclude,samfrac =0.1)

anova(Course.bam20, Course.bam4)

Model 1: zGrade ~ Rep + ISE + White + Female + Years + AP_TOTAL + MATH +
HSGPA + EOP + factor(P7APrior, ordered = FALSE) + s(Yfrm7A,
k = 20)
Model 2: zGrade ~ Rep + ISE + White + Female + Years + AP_TOTAL + MATH +
HSGPA + EOP + factor(P7APrior, ordered = FALSE) + s(Yfrm7A,
k = 4)
  Resid. Df Resid. Dev  Df Deviance
14721.7 1907.0
24724.5 1913.5 -2.7919  -6.4986

How can I get a p-value out of the anova?

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Centering multi-level unordered factors

2013-10-07 Thread Robert Lynch
I have a question I am not even sure quite how to ask.

When r fits models with  un-ordered categorical variables as predictors
(RHS of model) it automatically converts them into 1 less dichotomous
variables than there are levels.

For example  if I had levels(trait) = (A,B,C) it would automatically
recode to
  NewVar1 NewVar2
A 0   0
B 1   0
C  0   1

What I would like to know is, is there a way that I can center these
categorical variables, and if so how

for continuous variables it is simple
x - x-mean(x)

for a single dichotomous variable it is not so hard
gender - gender - sum(gender)/length(gender)
where the gender are (0,1) or (-.5,.5) for  example
 which would give  gender coefficients in a model  that would still reflect
the difference between the two genders but the intercept and the other
coefficients would be for some one of average gender

and it is that last part that I am unclear on for a multi (3 or more) level
factor.  How do you set up variables so that the *other* coefficients
reflect the average across the factor levels. Do I need two or three
centered variables? and is there a quick way to get at all those variables
if my factor has many levels, e.g. 14?


Robert

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] trouble with nlme: Error in MEEM() : Singularity in backsolve at level 0, block 1

2013-10-05 Thread Robert Lynch
I am trying to fit my data, attached,  with the following model

CutOff - 0

fit.full - lme(fixed= zGrade ~ Rep + ISE +Yfrm7A +Ufrm7A +Female +White
+HSGPA +MATH +AP_TOTAL +Years +Course +
 Course*Rep + Course*ISE +Course*Yfrm7A+Course*Ufrm7A
+Course*Female +Course*White +Course*HSGPA +Course*MATH +Course*AP_TOTA
L+Course*Years,
random= ~1|SID,
data = Master.complete[Master.complete$GRADE = CutOff,])
 I get the following error
Error in MEEM(object, conLin, control$niterEM) :
  Singularity in backsolve at level 0, block 1


when I take out
+Course*Yfrm7A+Course*Ufrm7A
and just use
fit.full - lme(fixed= zGrade ~ Rep + ISE +Yfrm7A +Ufrm7A +Female +White
+HSGPA +MATH +AP_TOTAL +Years +Course +
 Course*Rep + Course*ISE
+Course*Female+Course*White+Course*HSGPA+Course*MATH+Course*AP_TOTAL+Course*Years,
random= ~1|SID,
data = Master.complete[Master.complete$GRADE = CutOff,])
  I don't get an error

I think this is because when Course == P7A  Yfrm7A==Ufrm7A==0.  I am not
sure how to work around this.  any suggestions would be welcome.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] re-coding variables

2013-10-01 Thread Robert Lynch
I am running a large mixed model, 65k entries on 11 fixed effects and one
random. One of the fixed effects is Course a factor that takes on 14
different values
levels(Master.complete$Course)
 [1] B101  B2A   B2B   B2C   C118A C118B C118C C2A   C2B
[10] C2C   N101  P7A   P7B   P7C

and another Yfrm7A that is continuous
 summary(Master.complete$Yfrm7A)
Min.  1st Qu.   Median Mean  3rd Qu. Max.
-5.25000 -0.75000  0.0 -0.07688  0.5  6.0

but all the values are 0(zero) when course==P7A
summary(Master.complete$Yfrm7A[Master.complete$Course==P7A])
   Min. 1st Qu.  MedianMean 3rd Qu.Max.
  0   0   0   0   0   0

Thus when I run the following mixed model I get no errors
fit.full - lme(fixed= zGrade ~ Rep + ISE
+Yfrm7A+Ufrm7A+Female+White+HSGPA+MATH+AP_TOTAL+Years+Course +
  Course*Rep + Course*Female +Course*White,
random= ~1|SID, data = Master.complete,
na.action=na.exclude)

but if I add in a Course*Yfrm7A term
fit.full - lme(fixed= zGrade ~ Rep + ISE
+Yfrm7A+Ufrm7A+Female+White+HSGPA+MATH+AP_TOTAL+Years+Course +
+   Course*Rep + Course*Female +Course*White+Course*Yfrm7A,
+ random= ~1|SID, data = Master.complete,
na.action=na.exclude)

I get

Error in MEEM(object, conLin, control$niterEM) :
  Singularity in backsolve at level 0, block 1

I suspect I could solve this problem with ordering the levels of Course
so that P7A was the first level and thus the one that others were compared
to but I am unclear on how to do so.

Robert

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] ggplot legend formatting

2013-09-19 Thread Robert Lynch
I am having trouble getting my legend to format correctly in ggplot2. A
full description and pictures are in the ggplot google
grouphttps://groups.google.com/forum/?hl=en#!topic/ggplot2/LSarpgmSG8k.
but the short description is that in guides(fill = guide_legend(nrow =
3),bycol = TRUE) changing the call to have byrow=TRUE does not change the
plot.

Thanks
Robert

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] ggplot interactions

2013-09-11 Thread Robert Lynch
On Tue, Sep 10, 2013 at 11:33 PM, Robert Lynch robert.b.ly...@gmail.comwrote:

 I am sorry to ask what I am sure is a simple question but I am stuck
 trying to figure out how different parts of ggplot2 calls interact

 I am plotting using the following code

 ggplot(Chem.comp, aes(Course, GRADE)) + geom_boxplot(notch = TRUE,aes(fill
 = COHORT))+
   labs(y =Grade Points in class,
x = Chemistry 2 quarter) +
   ggtitle(expression(atop(Comparison between ISE cohorts and Peers,
 atop(italic(in Chem 2 classes), 
   ylim(0,4.)+
   scale_fill_manual(name = ISE Cohorts \nComparison groups,
 values =
 c(blue,red,blue3,red3,blue4,red4)) +
   theme(plot.title = element_text(size = 25, face = bold, colour =
 black, vjust = -1))+
   guides(fill = guide_legend(nrow = 3),byrow = TRUE)

 which give Rplot.jpeg  which is has the appropriate title, but the colors
 and the wrong are wrong.

 if I comment out the ggtitle() and theme(), or just ggtitle() I get
 Rplot01.jpeg which has the right colors but no title and subtitle.  Also
 the legend is out of order.  the first row should read ISE07 CMP07 with 08
 on the second row and 09 the third with a red column and a blue column.
  Changing byrow = TRUE to bycol = TRUE does not change the plotting of the
 legend nor does byrow=FALSE

 I am asking for help with getting the title and sub-title to both show up
 at the same time as the appropriate colors for the different factor levels.
 And to get the legend to render so that the legend looks sort of like
 ISE07  [redbox  ] [bluebox  ] CMP07
 ISE08  [red3box][blue3box]  CMP08
 ISE09  [red4box] [blue4box] CMP09

 the exact colors are not important the the vertical
 and horizontal alignment is.

 Thanks!
 Robert

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] ggplot interactions

2013-09-11 Thread Robert Lynch
 I am sorry to ask what I am sure is a simple question but I am stuck
 trying to figure out how different parts of ggplot2 calls interact

 I am plotting using the following code

 ggplot(Chem.comp, aes(Course, GRADE)) + geom_boxplot(notch = TRUE,aes(fill
 = COHORT))+
   labs(y =Grade Points in class,
x = Chemistry 2 quarter) +
   ggtitle(expression(atop(Comparison between ISE cohorts and Peers,
 atop(italic(in Chem 2 classes), 
   ylim(0,4.)+
   scale_fill_manual(name = ISE Cohorts \nComparison groups,
 values =
 c(blue,red,blue3,red3,blue4,red4)) +
   theme(plot.title = element_text(size = 25, face = bold, colour =
 black, vjust = -1))+
   guides(fill = guide_legend(nrow = 3),byrow = TRUE)

 which gives me a plot [available as a jpeg, not attached due to size
 limits]  which is has the appropriate title, but the colors and the legend
 are wrong. The colors cycle through R's standard colors, and the legend is
 1 column.

 if I comment out the ggtitle() and theme(), or just ggtitle() I get a plot
 [available, but not attached ue to size limits] which has the right colors
 and mostly right legend, but no title and subtitle.  The legend is out of
 order.  the first row should read ISE07 CMP07 with 08 on the second row and
 09 the third with a red column and a blue column.  Changing byrow = TRUE to
 bycol = TRUE does not change the plotting of the legend nor does byrow=FALSE

 I am asking for help with getting the title and sub-title to both show up
 at the same time as the appropriate colors for the different factor levels.
 And to get the legend to render so that the legend looks sort of like
 ISE07[red box][blue box   ]CMP07
 ISE08 [red3 box] [blue3 box]CMP08
 ISE09 [red4 box] [blue4 box]CMP09
 the exact colors are not important but the the vertical
 and horizontal alignment is.

 Thanks!
 Robert


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] finding both rows that are duplicated in a data frame

2013-09-07 Thread Robert Lynch
I have a data frame that looks like

id1-c(1,1,2,2,3,3,4,5,5,6,6,7,8,9,9,10)
id2-c(22,22,34,34,15,15,76,45,45,84,84,37,52,66,66,91)
GENDER-sample(c(G-UNK,G-M,G-F),16, replace = TRUE)
ETH -sample(c(E-AF,E-UNK,E-VT),16, replace = TRUE)
example-cbind(id1,id2,GENDER,ETH)

where there are two id's and some duplicate entries for ID's that have
different GENDER or ETH(nicity)
I would like to get a data frame that doesn't have the duplicates, but the
ones that are kept are which ever GENDER is not G-UNK (unknown) and the
kept ETH is what ever is not E-UNK

the resultant data frame should have 10 rows with no *-UNK in either of the
last two columns ( unless both entries were UNK)

yes the example data may have some impossible results but it does capture
important aspects.
1) G-UNK is alphabetically last of G-F, G-M  G-UNK
2) E-UNK is in the middle alphabetically
3) some times the first entry is the unknown gender, some times it is the
second *likely to happen with random sample
4) some times both entries for one variable, GENDER or ETH are unknown.
5) only appears to be two of each row, * not 100% sure

Thanks!
 Robert

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] string processing(regular expressions)

2013-09-01 Thread Robert Lynch
I have a variable that is course #
nCourse -
as.factor(c(002A,002B,002C,007A,007B,007C,101,118A,118B,118C))

And I would like to get rid of the leading zeros, and have the following
set
(2A,2B,2C,7A,7B,7C,101,118A,118B,118C) to paste()
together with the department, B,P,C (bio, phys,  chem etc)

I am stuck trying to figure out regular expressions, they are new to me.

Thank You very much

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Legend formatting (ggplot2)

2013-08-28 Thread Robert Lynch
I am having trouble getting my legend to format the way I want it to.  I
suspect it is something simple.


 the code I have is
 library(ggplot2)
 ggplot(Chem.comp, aes(Course, GRADE.)) + geom_boxplot(notch =
 TRUE,aes(fill = COHORT))+
   labs(title = Comparison between ISE cohorts and Peers in the Same Chem
 2 class,
y =Grade Points in class,
x = Chemistry 2 quarter) + ylim(0,4.)+guides(colour =
 guide_legend(nrow = 3))+
   scale_fill_manual(name = ISE Cohorts \nComparison groups, values =
 c(blue,red,blue3,red3,blue4,red4))

 which plots as attached

 I would like to have the legend as two columns one blue (ISE07, ISE08,
 ISE09) and one red ( Comparison 07, Comparison 08, Comparison 09)

 the guids(colour = guide_legend(nrow=3)) is what I found at stack overflow



 (
 http://stackoverflow.com/questions/12323416/arranging-ggplot2-legend-items-in-a-grid
 )
  and I am not quite sure how to parse for myself the ggplot documentation
 page but it looks the same.

http://docs.ggplot2.org/current/guide_legend.html


 Ideally I'd like the legend to have a column of text lables,(ISE...) a
 column of blue boxes, a column of red boxes and a column of text labels
 (Comp). but that is mostly just bonus.

 Thanks
 Robert



1ColLedgend.pdf
Description: Adobe PDF document
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] the inverse of assign()

2013-08-27 Thread Robert Lynch
I am looking for a way to extract the name of a variable that has been
passed into a function

for example
 foo -function(x){
 write.csv(x, file = paste(NAME(x), csv, sep =.))
}

is there a function NAME that would let the calls
foo(bar)
  write the file bar.csv
and foo(stuff)
  write the file stuff.csv

Robert

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] ave function

2013-08-20 Thread Robert Lynch
I tried
 lapply(split(Clean,list(Clean$TERM,Clean$INST_NUM)),function(x)
shapiro.test(x$GRADE))
 and I got
Error in shapiro.test(x$GRADE.) : sample size must be between 3 and 5000

I also tried
with(Clean, aggregate(GRADE,list(TERM,INST_NUM),FUN=shapiro.test))

and got
  Group.1 Group.2 x
1   201001  689809 0.9546164
2   201201  689809 0.9521624
3   201301  689809 0.9106206
4   200701  994474 0.8862705
5   200710  994474 0.9176743
6   201203 1105752 0.9382688
.
.
.
72  201001 1759272 0.9291295
73  201101 1759272 0.9347072
74  201110 1897809 0.9395375
Warning message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :
  corrupt data frame: columns will be truncated or padded with NAs

I am not sure how to interpret the output of the second.

Thanks!


On Tue, Aug 13, 2013 at 11:01 AM, arun smartpink...@yahoo.com wrote:

 Hi,
 You could try:
  lapply(split(Clean,list(Clean$TERM,Clean$INST_NUM)),function(x)
 shapiro.test(x$GRADE))
 A.K.




 - Original Message -
 From: Robert Lynch robert.b.ly...@gmail.com
 To: r-help@r-project.org
 Cc:
 Sent: Tuesday, August 13, 2013 1:46 PM
 Subject: [R] ave function

 I've written the following function
 CoursePrep - function (Source, SaveName) {


   Clean$TERM - as.factor(Clean$TERM)

   Clean$INST_NUM - as.factor(Clean$INST_NUM)
   Clean$zGrade - with(Clean, ave(GRADE., list(TERM, INST_NUM), FUN =
 scale))
   write.csv(Clean,paste(SaveName, csv, sep =.), row.names = FALSE)
   return(Clean)
 }

 which is all well and good, but I wan't to throw a shapiro.test in before I
 normalize.  that is I don't really understand quite how I did ( I got help)
 what I wanted to in the
 Clean$zGrade - with(Clean, ave(GRADE., list(TERM, INST_NUM), FUN = scale))
 that code for the whole of Clean finds all sets of GRADE.'s that have the
 same INST_NUM and TERM computes a mean, subtracts off the mean and divides
 by the standard deviation. I would like to for each one of those sets of
 grades to call shapiro.test() on the set, to see if it is normal *before* I
 assume it is.

 I know the naive
 with(Clean, shapiro.test( list(TERM, INST_NUM)))
 doesn't work.
 with(Clean, ave(GRADE., list(TERM, INST_NUM), FUN =
 function(x)shapiro.test(x)))

 which returns
 Error in shapiro.test(x) : sample size must be between 3 and 5000
 and I have checked that the sets selected are all of length between 3 and
 5000.
 using the following on my full data

 ClassSize - with(Clean, ave(GRADE., list(TERM, INST_NUM), FUN =
 function(x)length(x)))
  summary(ClassSize)
Min. 1st Qu.  MedianMean 3rd Qu.Max.
22.0   198.0   241.0   244.4   279.0   466.0

 here is some sample data
 GRADE TERM INST_NUM
 1,  9,   1
 2,  9,   1
 3,  9,   1
 1.5,   8,   2
 1.75, 8,   2
 2,  8,  2
 0.5,   9,   2
 2,  9,  2
 3.5,   9,  2
 3.5,8, 1
 3.75,  8, 1
 4,   8,  1

 and hopefully the code would test the following set of grades
 (1,2,3)(1.5,1.75,2)(0.5,2,3.5)(3.5,3.75,4)

 Thanks Robert

 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] ave function

2013-08-13 Thread Robert Lynch
I've written the following function
CoursePrep - function (Source, SaveName) {


  Clean$TERM - as.factor(Clean$TERM)

  Clean$INST_NUM - as.factor(Clean$INST_NUM)
  Clean$zGrade - with(Clean, ave(GRADE., list(TERM, INST_NUM), FUN =
scale))
  write.csv(Clean,paste(SaveName, csv, sep =.), row.names = FALSE)
  return(Clean)
}

which is all well and good, but I wan't to throw a shapiro.test in before I
normalize.  that is I don't really understand quite how I did ( I got help)
what I wanted to in the
 Clean$zGrade - with(Clean, ave(GRADE., list(TERM, INST_NUM), FUN = scale))
that code for the whole of Clean finds all sets of GRADE.'s that have the
same INST_NUM and TERM computes a mean, subtracts off the mean and divides
by the standard deviation. I would like to for each one of those sets of
grades to call shapiro.test() on the set, to see if it is normal *before* I
assume it is.

I know the naive
with(Clean, shapiro.test( list(TERM, INST_NUM)))
doesn't work.
with(Clean, ave(GRADE., list(TERM, INST_NUM), FUN =
function(x)shapiro.test(x)))

which returns
Error in shapiro.test(x) : sample size must be between 3 and 5000
and I have checked that the sets selected are all of length between 3 and
5000.
using the following on my full data

ClassSize - with(Clean, ave(GRADE., list(TERM, INST_NUM), FUN =
function(x)length(x)))
 summary(ClassSize)
   Min. 1st Qu.  MedianMean 3rd Qu.Max.
   22.0   198.0   241.0   244.4   279.0   466.0

here is some sample data
GRADE TERM INST_NUM
1,  9,   1
2,  9,   1
3,  9,   1
1.5,   8,   2
1.75, 8,   2
2,  8,  2
0.5,   9,   2
2,  9,  2
3.5,   9,  2
3.5,8, 1
3.75,  8, 1
4,   8,  1

and hopefully the code would test the following set of grades
(1,2,3)(1.5,1.75,2)(0.5,2,3.5)(3.5,3.75,4)

Thanks Robert

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] help with apply (lapply or sapply not sure)

2013-07-25 Thread Robert Lynch
I am reading in a bunch of files and then processing them all in the same
way.

I am sure there as a better way then to copy and past the code for each
file. Here is what I've done so far

InputFiles-
as.character(list.files(~/ISLE/RWork/DataWarehouseMining/byCourse/))
#Path to the Course data files

for (i in InputFiles) {
  # print(head(read.csv(paste(~/ISLE/RWork/DataWarehouseMining/byCourse/,
i, sep=
  print(paste(Reading file ~/ISLE/RWork/DataWarehouseMining/byCourse/, i,
sep=,))
  assign(i, read.csv(paste(~/ISLE/RWork/DataWarehouseMining/byCourse/, i,
sep=)))
}#note last file is NOT a course file by the student information.
Master-StudentInfoForRobertWUnitAt7A_2.csv #this is the last file


CourseFiles -InputFiles[- c(15,16)] # ignore the student info ...7A.csv 
...7A_2.csv

#for each file I do the following
#Bis 101
summary(BigInstBIS101.csv)
B101 - BigInstBIS101.csv[-c(3,4,8)]
summary(B101)
B101$WH_ID - as.factor(B101$WH_ID)
B101$SID - as.factor(B101$SID)
B101$TERM - as.factor(B101$TERM)
B101$CRN - as.factor(B101$CRN)
B101$CRN_TRM - as.factor(B101$CRN_TRM)
B101$INST_NUM - as.factor(B101$INST_NUM)
B101$zGrade - with(B101, ave(GRADE., list(TERM, INST_NUM), FUN = scale))
write.csv(B101,B101.csv, row.names = FALSE)

#Bis 2A
B2A - BigInstBIS2A.csv[-c(3,4,8)]
summary(B2A)
B2A$WH_ID - as.factor(B2A$WH_ID)
B2A$SID - as.factor(B2A$SID)
B2A$TERM - as.factor(B2A$TERM)
B2A$CRN - as.factor(B2A$CRN)
B2A$CRN_TRM - as.factor(B2A$CRN_TRM)
B2A$INST_NUM - as.factor(B2A$INST_NUM)
B2A$zGrade - with(B2A, ave(GRADE., list(TERM, INST_NUM), FUN = scale))
write.csv(B2A,B2A.csv, row.names = FALSE)


And so on for another 12 courses, however I am changing what I am doing as
part of the reading in the file and don't want to replace the code in 14
different places.
suggestions?  Thanks!

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] weighted average

2013-07-22 Thread Robert Lynch
I am trying to compute GPA from class grades(which have been normallized)
I have for example the following matrix

Master =
SIDB2AB2BB2C   C2A C2BC2CC118AC118B C118C
0010.010.5  -0.41.2   -1.8 0.3  -0.3   0.4
  0.5
0020.010.5  -0.40.5   -0.4 1.2  -1.8   0.3
  -0.3
0030.040.05 0.5-0.4 - 0.5 0.4  -1.2   1.8
0.3
etc

Where each column has a zero mean and a standard deviation of 1.  I want to
calculate a weighted average for each row(student ID) that takes into
account that
B2A, C118A, C118B, and C118C are all 4 unit classes, and the rest, B2B,
B2C, C2A,C2B,C2C are 5 unit classes

I have tried
Units-c(4,5,5,5,5,5,4,4,4)
Master$zGPA -weighted.means(Master[,2:10],Units)

But that gets me one number and not a vector.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] interpreting GLM results and plotting fits

2013-06-29 Thread Robert Lynch
I am trying to interpret the output of GLM and I am not sure how it is
treating the factor GENDER with levels G-M and G-F.

Below is the output of summary(GPA.lm)


Call:
glm(formula = zGPA ~ Units.Bfr.7A * GENDER, data = Master1)

Deviance Residuals:
Min   1Q   Median   3Q  Max
-1.1432  -0.3285  -0.1061   0.2283   1.8286

Coefficients:
 Estimate Std. Error t value Pr(|t|)
(Intercept)-2.513e-01  2.238e-02 -11.230   2e-16 ***
Units.Bfr.7A2.297e-05  2.851e-04   0.0810.936
GENDERG-M   3.183e-01  4.536e-02   7.018 2.56e-12 ***
Units.Bfr.7A:GENDERG-M -3.073e-03  5.975e-04  -5.142 2.82e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.2432662)

Null deviance: 1204.2  on 4875  degrees of freedom
Residual deviance: 1185.2  on 4872  degrees of freedom
  (106 observations deleted due to missingness)
AIC: 6950.8

Number of Fisher Scoring iterations: 2



second I would like to draw two lines w/ confidince intervals on the
scatter plot.  One for G-M and the other for G-F

I think I am doing this with
stat_smooth(aes(group=GENDER), method=glm, fullrange=TRUE)
 but again am not sure quite what is being outputted.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] data selection

2013-06-18 Thread Robert Lynch
I have two different data frames ( actually a set of data frames for each
class and one master one into which i want pull some data from each of the
frame in the set)
 one  is all students that have taken a course

so the set of data frames is
B101
B2A
B2B
B2C
etc. . .


and each one has lots of data e.g.
B101
SID zGRADE
444 -.2
458  0
587  .2
etc

and
Master
SID
 587
etc

and I would like  to make a field in master for each data frame
e.g. Master$B101, Master$B2A
and populate it with the zGrades from each of the data frames

the SID in Master are a  wholely containted sub set of the ones in each of
the other data frames

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Multiple selection and normalization

2013-06-03 Thread Robert Lynch
Hi--

I am trying to normalize course grades for each instance of a course, e.g.
Stats 1 Fall2009 J. Smith.

I have a frame for all instances of a course, e.g. stats 1 in the last 5
years, that looks like

SIDN  TERM  GRADE  INST

where SIDN is a Student ID Number, TERM is a factor that gives the quarter
and year a course was offered, GRADE is a 0-4.3 grade and INST is the
instructor, again as a factor.

Course offerings are determined by the TERM and INST.  That is one inst.
assigned grades to all the students they were responsible for that term.
 Multiple instructors may have taught the same term.

For every course offering I would like to normalize the GRADE:  Z- (GRADE
- mean)/SD  where the mean and SD are over a single course offering.

Thanks!
RBL

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.