Re: [R] Error with text analysis data
gt;> David. >> >> > >> > Best regards >> > >> > On Wed, Apr 13, 2022 at 9:53 PM Ebert,Timothy Aaron >> wrote: >> > >> >> Is this a different question from the original post? It would be >> better to >> >> keep threads separate. >> >> >> >> Always pre-process the data. Clean the data of obvious mistakes. This >> can >> >> be simple typographical errors or complicated like an author that >> wrote too >> >> when they intended two or to. In old English texts spelling was not >> >> standardized and the same word could have multiple spellings within one >> >> book or chapter. Removing punctuation is probably a part of this, >> though a >> >> program like Grammarly would not work very well if it removed >> punctuation. >> >> >> >> >> >> >> >> After that it depends on what you are trying to accomplish. Are you >> >> interested in the number of times an author used the word “a” or “the” >> and >> >> is “The” different from “the?” Are you modeling word use frequency or >> >> comparing vocabulary between texts. >> >> >> >> >> >> >> >> Too many choices. >> >> >> >> >> >> >> >> Tim >> >> >> >> >> >> >> >> *From:* Neha gupta >> >> *Sent:* Wednesday, April 13, 2022 2:49 PM >> >> *To:* Bill Dunlap >> >> *Cc:* Ebert,Timothy Aaron ; r-help mailing list < >> >> r-help@r-project.org> >> >> *Subject:* Re: Error with text analysis data >> >> >> >> >> >> >> >> *[External Email]* >> >> >> >> Someone just told me that you need to pre process the data before model >> >> construction. For instance, make the text to lower case, remove >> >> punctuation, symbols etc and tokenize the text (give number to each >> word). >> >> Then create word of bags model (not sure about it), and then create a >> >> model. >> >> >> >> >> >> >> >> Is it true to perform all these steps? >> >> >> >> >> >> >> >> Best regards >> >> >> >> On Wednesday, April 13, 2022, Bill Dunlap >> >> wrote: >> >> >> >>> I would always suggest working until the model works, no errors and >> no >> >> NA values >> >> >> >> >> >> >> >> We agree on that. However, the error gives you no hint about which >> >> variables are causing the problem. If it did, then it could only tell >> >> about the first variable with the problem. I think you would get to >> your >> >> working model faster if you got NA's for the constant columns and then >> >> could drop them all at once (or otherwise deal with them). >> >> >> >> >> >> >> >> -Bill >> >> >> >> >> >> >> >> On Wed, Apr 13, 2022 at 9:40 AM Ebert,Timothy Aaron >> >> wrote: >> >> >> >> I suspect that it is because you are looking at two types of error, >> both >> >> telling you that the model was not appropriate. In the “error in >> contrasts” >> >> there is nothing to contrast in the model. For a numerical constant the >> >> program calculates the standard deviation and ends with a division by >> zero. >> >> Division by zero is undefined, or NA. >> >> >> >> >> >> >> >> I would always suggest working until the model works, no errors and no >> NA >> >> values. The reason is that I can get NA in several ways and I need to >> >> understand why. If I just ignore the NA in my model I may be assuming >> the >> >> wrong thing. >> >> >> >> >> >> >> >> Tim >> >> >> >> >> >> >> >> *From:* Bill Dunlap >> >> *Sent:* Wednesday, April 13, 2022 12:23 PM >> >> *To:* Ebert,Timothy Aaron >> >> *Cc:* Neha gupta ; r-help mailing list < >> >> r-help@r-project.org> >> >> *Subject:* Re: [R] Error with text analysis data >> >> >> >> >> >> >> >> *[External Email]* >> >> >> >> Constant columns can be the model when you do some subsetting or a
Re: [R] Error with text analysis data
Is this a different question from the original post? It would be better to keep threads separate. Always pre-process the data. Clean the data of obvious mistakes. This can be simple typographical errors or complicated like an author that wrote too when they intended two or to. In old English texts spelling was not standardized and the same word could have multiple spellings within one book or chapter. Removing punctuation is probably a part of this, though a program like Grammarly would not work very well if it removed punctuation. After that it depends on what you are trying to accomplish. Are you interested in the number of times an author used the word “a” or “the” and is “The” different from “the?” Are you modeling word use frequency or comparing vocabulary between texts. Too many choices. Tim From: Neha gupta Sent: Wednesday, April 13, 2022 2:49 PM To: Bill Dunlap Cc: Ebert,Timothy Aaron ; r-help mailing list Subject: Re: Error with text analysis data [External Email] Someone just told me that you need to pre process the data before model construction. For instance, make the text to lower case, remove punctuation, symbols etc and tokenize the text (give number to each word). Then create word of bags model (not sure about it), and then create a model. Is it true to perform all these steps? Best regards On Wednesday, April 13, 2022, Bill Dunlap mailto:williamwdun...@gmail.com>> wrote: > I would always suggest working until the model works, no errors and no NA > values We agree on that. However, the error gives you no hint about which variables are causing the problem. If it did, then it could only tell about the first variable with the problem. I think you would get to your working model faster if you got NA's for the constant columns and then could drop them all at once (or otherwise deal with them). -Bill On Wed, Apr 13, 2022 at 9:40 AM Ebert,Timothy Aaron mailto:teb...@ufl.edu>> wrote: I suspect that it is because you are looking at two types of error, both telling you that the model was not appropriate. In the “error in contrasts” there is nothing to contrast in the model. For a numerical constant the program calculates the standard deviation and ends with a division by zero. Division by zero is undefined, or NA. I would always suggest working until the model works, no errors and no NA values. The reason is that I can get NA in several ways and I need to understand why. If I just ignore the NA in my model I may be assuming the wrong thing. Tim From: Bill Dunlap mailto:williamwdun...@gmail.com>> Sent: Wednesday, April 13, 2022 12:23 PM To: Ebert,Timothy Aaron mailto:teb...@ufl.edu>> Cc: Neha gupta mailto:neha.bologn...@gmail.com>>; r-help mailing list mailto:r-help@r-project.org>> Subject: Re: [R] Error with text analysis data [External Email] Constant columns can be the model when you do some subsetting or are exploring a new dataset. My objection is that constant columns of numbers and logicals are fine but those of characters and factors are not. -Bill On Wed, Apr 13, 2022 at 9:15 AM Ebert,Timothy Aaron mailto:teb...@ufl.edu>> wrote: What is the goal of having a constant in the model? To me that seems pointless. Also there is no variability in sexCode regardless of whether you call it integer or factor. So the model y ~ sexCode is just a strange way to look at the variability in y and it would be better to do something like summarize(y) or mean(y) if that was the goal. Tim -Original Message- From: R-help mailto:r-help-boun...@r-project.org>> On Behalf Of Bill Dunlap Sent: Wednesday, April 13, 2022 9:56 AM To: Neha gupta mailto:neha.bologn...@gmail.com>> Cc: r-help mailing list mailto:r-help@r-project.org>> Subject: Re: [R] Error with text analysis data [External Email] This sounds like what I think is a bug in stats::model.matrix.default(): a numeric column with all identical entries is fine but a constant character or factor column is not. > d <- data.frame(y=1:5, sex=rep("Female",5)) d$sexFactor <- > factor(d$sex, levels=c("Male","Female")) d$sexCode <- > as.integer(d$sexFactor) d ysex sexFactor sexCode 1 1 FemaleFemale 2 2 2 FemaleFemale 2 3 3 FemaleFemale 2 4 4 FemaleFemale 2 5 5 FemaleFemale 2 > lm(y~sex, data=d) Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels > lm(y~sexFactor, data=d) Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels > lm(y~sexCode, data=d) Call: lm(formula = y ~ sexCode, data = d) Coefficients: (Intercept) sexCode 3 NA Calling traceback() after the error would clarify this. -Bill On Tue, Apr 12, 2022 at 3:12 PM Neha gupt
Re: [R] Error with text analysis data
> I would always suggest working until the model works, no errors and no NA values We agree on that. However, the error gives you no hint about which variables are causing the problem. If it did, then it could only tell about the first variable with the problem. I think you would get to your working model faster if you got NA's for the constant columns and then could drop them all at once (or otherwise deal with them). -Bill On Wed, Apr 13, 2022 at 9:40 AM Ebert,Timothy Aaron wrote: > I suspect that it is because you are looking at two types of error, both > telling you that the model was not appropriate. In the “error in contrasts” > there is nothing to contrast in the model. For a numerical constant the > program calculates the standard deviation and ends with a division by zero. > Division by zero is undefined, or NA. > > > > I would always suggest working until the model works, no errors and no NA > values. The reason is that I can get NA in several ways and I need to > understand why. If I just ignore the NA in my model I may be assuming the > wrong thing. > > > > Tim > > > > *From:* Bill Dunlap > *Sent:* Wednesday, April 13, 2022 12:23 PM > *To:* Ebert,Timothy Aaron > *Cc:* Neha gupta ; r-help mailing list < > r-help@r-project.org> > *Subject:* Re: [R] Error with text analysis data > > > > *[External Email]* > > Constant columns can be the model when you do some subsetting or are > exploring a new dataset. My objection is that constant columns of numbers > and logicals are fine but those of characters and factors are not. > > > > -Bill > > > > On Wed, Apr 13, 2022 at 9:15 AM Ebert,Timothy Aaron > wrote: > > What is the goal of having a constant in the model? To me that seems > pointless. Also there is no variability in sexCode regardless of whether > you call it integer or factor. So the model y ~ sexCode is just a strange > way to look at the variability in y and it would be better to do something > like summarize(y) or mean(y) if that was the goal. > > Tim > > -Original Message----- > From: R-help On Behalf Of Bill Dunlap > Sent: Wednesday, April 13, 2022 9:56 AM > To: Neha gupta > Cc: r-help mailing list > Subject: Re: [R] Error with text analysis data > > [External Email] > > This sounds like what I think is a bug in stats::model.matrix.default(): a > numeric column with all identical entries is fine but a constant character > or factor column is not. > > > d <- data.frame(y=1:5, sex=rep("Female",5)) d$sexFactor <- > > factor(d$sex, levels=c("Male","Female")) d$sexCode <- > > as.integer(d$sexFactor) d > ysex sexFactor sexCode > 1 1 FemaleFemale 2 > 2 2 FemaleFemale 2 > 3 3 FemaleFemale 2 > 4 4 FemaleFemale 2 > 5 5 FemaleFemale 2 > > lm(y~sex, data=d) > Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : > contrasts can be applied only to factors with 2 or more levels > > lm(y~sexFactor, data=d) > Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : > contrasts can be applied only to factors with 2 or more levels > > lm(y~sexCode, data=d) > > Call: > lm(formula = y ~ sexCode, data = d) > > Coefficients: > (Intercept) sexCode > 3 NA > > Calling traceback() after the error would clarify this. > > -Bill > > > On Tue, Apr 12, 2022 at 3:12 PM Neha gupta > wrote: > > > Hello everyone, I have text data with output variable have three > subgroups. > > I am using the following code but getting the error message (see error > > after the code). > > > > d=read.csv("SONAR_RULES.csv", stringsAsFactors = FALSE) > > d$REMEDIATION_FUNCTION=NULL d$DEF_REMEDIATION_GAP_MULT=NULL > > d$REMEDIATION_BASE_EFFORT=NULL > > > > index <- createDataPartition(d$TYPE, p = .70,list = FALSE) tr <- > > d[index, ] ts <- d[-index, ] > > > > ctrl <- trainControl(method = "cv",number=3, index = index, classProbs > > = TRUE, summaryFunction = multiClassSummary) > > > > ran <- train(TYPE ~ ., data = tr, > > method = "rpart", > > ## Will create 48 parameter combinations > > tuneLength = 3, > > na.action= na.pass, > > metric = "Accuracy", > > preProc = c("center", "scale", "nzv"), > > trControl = ctrl) > > getTrainPerf(ran) > > > > *It gives me error:* > > > > >
Re: [R] Error with text analysis data
Constant columns can be the model when you do some subsetting or are exploring a new dataset. My objection is that constant columns of numbers and logicals are fine but those of characters and factors are not. -Bill On Wed, Apr 13, 2022 at 9:15 AM Ebert,Timothy Aaron wrote: > What is the goal of having a constant in the model? To me that seems > pointless. Also there is no variability in sexCode regardless of whether > you call it integer or factor. So the model y ~ sexCode is just a strange > way to look at the variability in y and it would be better to do something > like summarize(y) or mean(y) if that was the goal. > > Tim > > -Original Message- > From: R-help On Behalf Of Bill Dunlap > Sent: Wednesday, April 13, 2022 9:56 AM > To: Neha gupta > Cc: r-help mailing list > Subject: Re: [R] Error with text analysis data > > [External Email] > > This sounds like what I think is a bug in stats::model.matrix.default(): a > numeric column with all identical entries is fine but a constant character > or factor column is not. > > > d <- data.frame(y=1:5, sex=rep("Female",5)) d$sexFactor <- > > factor(d$sex, levels=c("Male","Female")) d$sexCode <- > > as.integer(d$sexFactor) d > ysex sexFactor sexCode > 1 1 FemaleFemale 2 > 2 2 FemaleFemale 2 > 3 3 FemaleFemale 2 > 4 4 FemaleFemale 2 > 5 5 FemaleFemale 2 > > lm(y~sex, data=d) > Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : > contrasts can be applied only to factors with 2 or more levels > > lm(y~sexFactor, data=d) > Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : > contrasts can be applied only to factors with 2 or more levels > > lm(y~sexCode, data=d) > > Call: > lm(formula = y ~ sexCode, data = d) > > Coefficients: > (Intercept) sexCode > 3 NA > > Calling traceback() after the error would clarify this. > > -Bill > > > On Tue, Apr 12, 2022 at 3:12 PM Neha gupta > wrote: > > > Hello everyone, I have text data with output variable have three > subgroups. > > I am using the following code but getting the error message (see error > > after the code). > > > > d=read.csv("SONAR_RULES.csv", stringsAsFactors = FALSE) > > d$REMEDIATION_FUNCTION=NULL d$DEF_REMEDIATION_GAP_MULT=NULL > > d$REMEDIATION_BASE_EFFORT=NULL > > > > index <- createDataPartition(d$TYPE, p = .70,list = FALSE) tr <- > > d[index, ] ts <- d[-index, ] > > > > ctrl <- trainControl(method = "cv",number=3, index = index, classProbs > > = TRUE, summaryFunction = multiClassSummary) > > > > ran <- train(TYPE ~ ., data = tr, > > method = "rpart", > > ## Will create 48 parameter combinations > > tuneLength = 3, > > na.action= na.pass, > > metric = "Accuracy", > > preProc = c("center", "scale", "nzv"), > > trControl = ctrl) > > getTrainPerf(ran) > > > > *It gives me error:* > > > > > > *Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : > > contrasts can be applied only to factors with 2 or more levels* > > > > > > *My data is as follow* > > > > Rows: 1,819 > > Columns: 14 > > $ PLUGIN_RULE_KEY "InsufficientBranchCoverage", > > "InsufficientLin~ > > $ PLUGIN_CONFIG_KEY"", "", "", "", "", "", "", "", "", > "", > > "S1120~ > > $ PLUGIN_NAME "common-java", "common-java", > > "common-java", "~ > > $ DESCRIPTION "An issue is created on a file as > soon > > as the ~ > > $ SEVERITY "MAJOR", "MAJOR", "MAJOR", "MAJOR", > > "MAJOR", "~ > > $ NAME "Branches should have sufficient > > coverage by t~ > > $ DEF_REMEDIATION_FUNCTION "LINEAR", "LINEAR", "LINEAR", > > "LINEAR_OFFSET",~ > > $ REMEDIATION_GAP_MULT NA, NA, NA, NA, NA, NA, NA, NA, NA, > NA, > > NA, NA~ > > $ DEF_REMEDIATION_BASE_EFFORT "", "", "", "10min", "", "", > > "5min", "5min", &q
Re: [R] Error with text analysis data
This sounds like what I think is a bug in stats::model.matrix.default(): a numeric column with all identical entries is fine but a constant character or factor column is not. > d <- data.frame(y=1:5, sex=rep("Female",5)) > d$sexFactor <- factor(d$sex, levels=c("Male","Female")) > d$sexCode <- as.integer(d$sexFactor) > d ysex sexFactor sexCode 1 1 FemaleFemale 2 2 2 FemaleFemale 2 3 3 FemaleFemale 2 4 4 FemaleFemale 2 5 5 FemaleFemale 2 > lm(y~sex, data=d) Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels > lm(y~sexFactor, data=d) Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels > lm(y~sexCode, data=d) Call: lm(formula = y ~ sexCode, data = d) Coefficients: (Intercept) sexCode 3 NA Calling traceback() after the error would clarify this. -Bill On Tue, Apr 12, 2022 at 3:12 PM Neha gupta wrote: > Hello everyone, I have text data with output variable have three subgroups. > I am using the following code but getting the error message (see error > after the code). > > d=read.csv("SONAR_RULES.csv", stringsAsFactors = FALSE) > d$REMEDIATION_FUNCTION=NULL > d$DEF_REMEDIATION_GAP_MULT=NULL > d$REMEDIATION_BASE_EFFORT=NULL > > index <- createDataPartition(d$TYPE, p = .70,list = FALSE) > tr <- d[index, ] > ts <- d[-index, ] > > ctrl <- trainControl(method = "cv",number=3, index = index, classProbs = > TRUE, summaryFunction = multiClassSummary) > > ran <- train(TYPE ~ ., data = tr, > method = "rpart", > ## Will create 48 parameter combinations > tuneLength = 3, > na.action= na.pass, > metric = "Accuracy", > preProc = c("center", "scale", "nzv"), > trControl = ctrl) > getTrainPerf(ran) > > *It gives me error:* > > > *Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : > contrasts can be applied only to factors with 2 or more levels* > > > *My data is as follow* > > Rows: 1,819 > Columns: 14 > $ PLUGIN_RULE_KEY "InsufficientBranchCoverage", > "InsufficientLin~ > $ PLUGIN_CONFIG_KEY"", "", "", "", "", "", "", "", "", "", > "S1120~ > $ PLUGIN_NAME "common-java", "common-java", > "common-java", "~ > $ DESCRIPTION "An issue is created on a file as soon > as the ~ > $ SEVERITY "MAJOR", "MAJOR", "MAJOR", "MAJOR", > "MAJOR", "~ > $ NAME "Branches should have sufficient > coverage by t~ > $ DEF_REMEDIATION_FUNCTION "LINEAR", "LINEAR", "LINEAR", > "LINEAR_OFFSET",~ > $ REMEDIATION_GAP_MULT NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, > NA, NA~ > $ DEF_REMEDIATION_BASE_EFFORT "", "", "", "10min", "", "", "5min", > "5min", "~ > $ GAP_DESCRIPTION "number of uncovered conditions", > "number of l~ > $ SYSTEM_TAGS "bad-practice", "bad-practice", > "convention", ~ > $ IS_TEMPLATE 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0~ > $ DESCRIPTION_FORMAT "HTML", "HTML", "HTML", "HTML", "HTML", > "HTML"~ > $ TYPE "CODE_SMELL", "CODE_SMELL", > "CODE_SMELL", "COD~ > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Error with text analysis data
Hi Neha, The error message is about not having _factors_ with two or more levels. Apart from using stringsAsFactors=FALSE (meaning that you probably won't get any factors in "d"), your sample data doesn't look like CSV format. Perhaps the lines have been truncated. You may get something with stringsAsFactors=TRUE, but I don't know whether it will be sensibler. Jim On Wed, Apr 13, 2022 at 8:12 AM Neha gupta wrote: > > Hello everyone, I have text data with output variable have three subgroups. > I am using the following code but getting the error message (see error > after the code). > > d=read.csv("SONAR_RULES.csv", stringsAsFactors = FALSE) > d$REMEDIATION_FUNCTION=NULL > d$DEF_REMEDIATION_GAP_MULT=NULL > d$REMEDIATION_BASE_EFFORT=NULL > > index <- createDataPartition(d$TYPE, p = .70,list = FALSE) > tr <- d[index, ] > ts <- d[-index, ] > > ctrl <- trainControl(method = "cv",number=3, index = index, classProbs = > TRUE, summaryFunction = multiClassSummary) > > ran <- train(TYPE ~ ., data = tr, > method = "rpart", > ## Will create 48 parameter combinations > tuneLength = 3, > na.action= na.pass, > metric = "Accuracy", > preProc = c("center", "scale", "nzv"), > trControl = ctrl) > getTrainPerf(ran) > > *It gives me error:* > > > *Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : > contrasts can be applied only to factors with 2 or more levels* > > > *My data is as follow* > > Rows: 1,819 > Columns: 14 > $ PLUGIN_RULE_KEY "InsufficientBranchCoverage", > "InsufficientLin~ > $ PLUGIN_CONFIG_KEY"", "", "", "", "", "", "", "", "", "", > "S1120~ > $ PLUGIN_NAME "common-java", "common-java", > "common-java", "~ > $ DESCRIPTION "An issue is created on a file as soon > as the ~ > $ SEVERITY "MAJOR", "MAJOR", "MAJOR", "MAJOR", > "MAJOR", "~ > $ NAME "Branches should have sufficient > coverage by t~ > $ DEF_REMEDIATION_FUNCTION "LINEAR", "LINEAR", "LINEAR", > "LINEAR_OFFSET",~ > $ REMEDIATION_GAP_MULT NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, > NA, NA~ > $ DEF_REMEDIATION_BASE_EFFORT "", "", "", "10min", "", "", "5min", > "5min", "~ > $ GAP_DESCRIPTION "number of uncovered conditions", > "number of l~ > $ SYSTEM_TAGS "bad-practice", "bad-practice", > "convention", ~ > $ IS_TEMPLATE 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0~ > $ DESCRIPTION_FORMAT "HTML", "HTML", "HTML", "HTML", "HTML", > "HTML"~ > $ TYPE "CODE_SMELL", "CODE_SMELL", > "CODE_SMELL", "COD~ > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.