Hi David, I just used caret library and farff file format.
sapply(d, function(x){ length(unique(x)) } ) PLUGIN_RULE_KEY PLUGIN_CONFIG_KEY PLUGIN_NAME 1817 1211 14 DESCRIPTION SEVERITY NAME 1815 5 1813 DEF_REMEDIATION_FUNCTION REMEDIATION_GAP_MULT DEF_REMEDIATION_BASE_EFFORT 4 1 15 GAP_DESCRIPTION SYSTEM_TAGS IS_TEMPLATE 15 142 2 DESCRIPTION_FORMAT TYPE 2 3 On Thu, Apr 14, 2022 at 2:30 AM David Winsemius <dwinsem...@comcast.net> wrote: > > On 4/13/22 16:58, Neha gupta wrote: > > summary(d) > > PLUGIN_RULE_KEY PLUGIN_CONFIG_KEY PLUGIN_NAME DESCRIPTION > > Length:1819 Length:1819 Length:1819 Length:1819 > > Class :character Class :character Class :character Class :character > > Mode :character Mode :character Mode :character Mode :character > > > > > > > > SEVERITY NAME REMEDIATION_FUNCTION > Length:1819 Length:1819 Mode:logical > Class :character Class :character NA's:1819 > Mode :character Mode :character > > According to your code that column (and the two following it) should have > been deleted with the lines: > > d$REMEDIATION_FUNCTION=NULL > d$DEF_REMEDIATION_GAP_MULT=NULL > d$REMEDIATION_BASE_EFFORT=NULL > > # And we're still only guessing that the caret package has been loaded since > you have not offered the library calls that have set your workspace. > > Now we need: > sapply(d, function(x){ length(unique(x)) } ) > > # I generally use Hmisc::describe since it is a vast improvement over the > default summary.data.frame > > > > > > > > DEF_REMEDIATION_FUNCTION REMEDIATION_GAP_MULT DEF_REMEDIATION_GAP_MULT > Length:1819 Mode:logical Length:1819 > Class :character NA's:1819 Class :character > Mode :character Mode :character > > > > REMEDIATION_BASE_EFFORT > > So that column should also have been deleted since it is all NA.. > > > DEF_REMEDIATION_BASE_EFFORT GAP_DESCRIPTION > Mode:logical Length:1819 Length:1819 > NA's:1819 Class :character Class :character > Mode :character Mode :character > > > > SYSTEM_TAGS IS_TEMPLATE DESCRIPTION_FORMAT TYPE > Length:1819 Min. :0.00000 Length:1819 Length:1819 > Class :character 1st Qu.:0.00000 Class :character Class :character > Mode :character Median :0.00000 Mode :character Mode :character > Mean :0.02859 > 3rd Qu.:0.00000 > Max. :1.00000 > > > The IS_TEMPLATE column certainly does not look like "text". > > You have been posting in HTML. STOP DOING THAT. Read the Posting Guide > > > -- > David. > > > On Thu, Apr 14, 2022 at 1:33 AM David Winsemius <dwinsem...@comcast.net> > wrote: > >> >> On 4/13/22 13:07, Neha gupta wrote: >> > Thank you Tim >> > >> > My purpose and aim is to train a model >> >> >> It appears you are using the 'caret' package. You should have posted >> code that loaded all the packages that you thought were essential to the >> effort. >> >> > (based on the data I provided in my >> > first email) and predict the output variable (TYPE variable which has >> three >> > different values like Severity, bugs, code smell etc) The data as you >> can >> > see also have a few text columns which is creating problems for me as I >> > have never worked before with text data. >> You offered what looks something like the output of `str(.)` on a >> dataframe. It would also be a great help if you offered the output of >> `summary(d)` >> > >> > I read a tutorial >> What tutorial? >> > that says stringAsFactor should be false, symbols etc >> > should be removed, then data should be in tokens, then the text should >> be >> > placed in a matrix/table form, and then a model should be built. I am >> not >> > sure if these steps are required in my case. Its I think a word2vec >> problem >> > though I did not use it before. >> >> >> I'm not sure what a "word2vec problem" might be. >> >> -- >> >> David. >> >> > >> > Best regards >> > >> > On Wed, Apr 13, 2022 at 9:53 PM Ebert,Timothy Aaron <teb...@ufl.edu> >> wrote: >> > >> >> Is this a different question from the original post? It would be >> better to >> >> keep threads separate. >> >> >> >> Always pre-process the data. Clean the data of obvious mistakes. This >> can >> >> be simple typographical errors or complicated like an author that >> wrote too >> >> when they intended two or to. In old English texts spelling was not >> >> standardized and the same word could have multiple spellings within one >> >> book or chapter. Removing punctuation is probably a part of this, >> though a >> >> program like Grammarly would not work very well if it removed >> punctuation. >> >> >> >> >> >> >> >> After that it depends on what you are trying to accomplish. Are you >> >> interested in the number of times an author used the word “a” or “the” >> and >> >> is “The” different from “the?” Are you modeling word use frequency or >> >> comparing vocabulary between texts. >> >> >> >> >> >> >> >> Too many choices. >> >> >> >> >> >> >> >> Tim >> >> >> >> >> >> >> >> *From:* Neha gupta <neha.bologn...@gmail.com> >> >> *Sent:* Wednesday, April 13, 2022 2:49 PM >> >> *To:* Bill Dunlap <williamwdun...@gmail.com> >> >> *Cc:* Ebert,Timothy Aaron <teb...@ufl.edu>; r-help mailing list < >> >> r-help@r-project.org> >> >> *Subject:* Re: Error with text analysis data >> >> >> >> >> >> >> >> *[External Email]* >> >> >> >> Someone just told me that you need to pre process the data before model >> >> construction. For instance, make the text to lower case, remove >> >> punctuation, symbols etc and tokenize the text (give number to each >> word). >> >> Then create word of bags model (not sure about it), and then create a >> >> model. >> >> >> >> >> >> >> >> Is it true to perform all these steps? >> >> >> >> >> >> >> >> Best regards >> >> >> >> On Wednesday, April 13, 2022, Bill Dunlap <williamwdun...@gmail.com> >> >> wrote: >> >> >> >>> I would always suggest working until the model works, no errors and >> no >> >> NA values >> >> >> >> >> >> >> >> We agree on that. However, the error gives you no hint about which >> >> variables are causing the problem. If it did, then it could only tell >> >> about the first variable with the problem. I think you would get to >> your >> >> working model faster if you got NA's for the constant columns and then >> >> could drop them all at once (or otherwise deal with them). >> >> >> >> >> >> >> >> -Bill >> >> >> >> >> >> >> >> On Wed, Apr 13, 2022 at 9:40 AM Ebert,Timothy Aaron <teb...@ufl.edu> >> >> wrote: >> >> >> >> I suspect that it is because you are looking at two types of error, >> both >> >> telling you that the model was not appropriate. In the “error in >> contrasts” >> >> there is nothing to contrast in the model. For a numerical constant the >> >> program calculates the standard deviation and ends with a division by >> zero. >> >> Division by zero is undefined, or NA. >> >> >> >> >> >> >> >> I would always suggest working until the model works, no errors and no >> NA >> >> values. The reason is that I can get NA in several ways and I need to >> >> understand why. If I just ignore the NA in my model I may be assuming >> the >> >> wrong thing. >> >> >> >> >> >> >> >> Tim >> >> >> >> >> >> >> >> *From:* Bill Dunlap <williamwdun...@gmail.com> >> >> *Sent:* Wednesday, April 13, 2022 12:23 PM >> >> *To:* Ebert,Timothy Aaron <teb...@ufl.edu> >> >> *Cc:* Neha gupta <neha.bologn...@gmail.com>; r-help mailing list < >> >> r-help@r-project.org> >> >> *Subject:* Re: [R] Error with text analysis data >> >> >> >> >> >> >> >> *[External Email]* >> >> >> >> Constant columns can be the model when you do some subsetting or are >> >> exploring a new dataset. My objection is that constant columns of >> numbers >> >> and logicals are fine but those of characters and factors are not. >> >> >> >> >> >> >> >> -Bill >> >> >> >> >> >> >> >> On Wed, Apr 13, 2022 at 9:15 AM Ebert,Timothy Aaron <teb...@ufl.edu> >> >> wrote: >> >> >> >> What is the goal of having a constant in the model? To me that seems >> >> pointless. Also there is no variability in sexCode regardless of >> whether >> >> you call it integer or factor. So the model y ~ sexCode is just a >> strange >> >> way to look at the variability in y and it would be better to do >> something >> >> like summarize(y) or mean(y) if that was the goal. >> >> >> >> Tim >> >> >> >> -----Original Message----- >> >> From: R-help <r-help-boun...@r-project.org> On Behalf Of Bill Dunlap >> >> Sent: Wednesday, April 13, 2022 9:56 AM >> >> To: Neha gupta <neha.bologn...@gmail.com> >> >> Cc: r-help mailing list <r-help@r-project.org> >> >> Subject: Re: [R] Error with text analysis data >> >> >> >> [External Email] >> >> >> >> This sounds like what I think is a bug in >> stats::model.matrix.default(): a >> >> numeric column with all identical entries is fine but a constant >> character >> >> or factor column is not. >> >> >> >>> d <- data.frame(y=1:5, sex=rep("Female",5)) d$sexFactor <- >> >>> factor(d$sex, levels=c("Male","Female")) d$sexCode <- >> >>> as.integer(d$sexFactor) d >> >> y sex sexFactor sexCode >> >> 1 1 Female Female 2 >> >> 2 2 Female Female 2 >> >> 3 3 Female Female 2 >> >> 4 4 Female Female 2 >> >> 5 5 Female Female 2 >> >>> lm(y~sex, data=d) >> >> Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : >> >> contrasts can be applied only to factors with 2 or more levels >> >>> lm(y~sexFactor, data=d) >> >> Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : >> >> contrasts can be applied only to factors with 2 or more levels >> >>> lm(y~sexCode, data=d) >> >> Call: >> >> lm(formula = y ~ sexCode, data = d) >> >> >> >> Coefficients: >> >> (Intercept) sexCode >> >> 3 NA >> >> >> >> Calling traceback() after the error would clarify this. >> >> >> >> -Bill >> >> >> >> >> >> On Tue, Apr 12, 2022 at 3:12 PM Neha gupta <neha.bologn...@gmail.com> >> >> wrote: >> >> >> >>> Hello everyone, I have text data with output variable have three >> >> subgroups. >> >>> I am using the following code but getting the error message (see error >> >>> after the code). >> >>> >> >>> d=read.csv("SONAR_RULES.csv", stringsAsFactors = FALSE) >> >>> d$REMEDIATION_FUNCTION=NULL d$DEF_REMEDIATION_GAP_MULT=NULL >> >>> d$REMEDIATION_BASE_EFFORT=NULL >> >>> >> >>> index <- createDataPartition(d$TYPE, p = .70,list = FALSE) tr <- >> >>> d[index, ] ts <- d[-index, ] >> >>> >> >>> ctrl <- trainControl(method = "cv",number=3, index = index, classProbs >> >>> = TRUE, summaryFunction = multiClassSummary) >> >>> >> >>> ran <- train(TYPE ~ ., data = tr, >> >>> method = "rpart", >> >>> ## Will create 48 parameter combinations >> >>> tuneLength = 3, >> >>> na.action= na.pass, >> >>> metric = "Accuracy", >> >>> preProc = c("center", "scale", "nzv"), >> >>> trControl = ctrl) >> >>> getTrainPerf(ran) >> >>> >> >>> *It gives me error:* >> >>> >> >>> >> >>> *Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : >> >>> contrasts can be applied only to factors with 2 or more levels* >> >>> >> >>> >> >>> *My data is as follow* >> >>> >> >>> Rows: 1,819 >> >>> Columns: 14 >> >>> $ PLUGIN_RULE_KEY <chr> "InsufficientBranchCoverage", >> >>> "InsufficientLin~ >> >>> $ PLUGIN_CONFIG_KEY <chr> "", "", "", "", "", "", "", "", >> "", >> >> "", >> >>> "S1120~ >> >>> $ PLUGIN_NAME <chr> "common-java", "common-java", >> >>> "common-java", "~ >> >>> $ DESCRIPTION <chr> "An issue is created on a file as >> >> soon >> >>> as the ~ >> >>> $ SEVERITY <chr> "MAJOR", "MAJOR", "MAJOR", >> "MAJOR", >> >>> "MAJOR", "~ >> >>> $ NAME <chr> "Branches should have sufficient >> >>> coverage by t~ >> >>> $ DEF_REMEDIATION_FUNCTION <chr> "LINEAR", "LINEAR", "LINEAR", >> >>> "LINEAR_OFFSET",~ >> >>> $ REMEDIATION_GAP_MULT <lgl> NA, NA, NA, NA, NA, NA, NA, NA, >> NA, >> >> NA, >> >>> NA, NA~ >> >>> $ DEF_REMEDIATION_BASE_EFFORT <chr> "", "", "", "10min", "", "", >> >>> "5min", "5min", "~ >> >>> $ GAP_DESCRIPTION <chr> "number of uncovered conditions", >> >>> "number of l~ >> >>> $ SYSTEM_TAGS <chr> "bad-practice", "bad-practice", >> >>> "convention", ~ >> >>> $ IS_TEMPLATE <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, >> 0, >> >> 0, >> >>> 0, 0, 0~ >> >>> $ DESCRIPTION_FORMAT <chr> "HTML", "HTML", "HTML", "HTML", >> >> "HTML", >> >>> "HTML"~ >> >>> $ TYPE <chr> "CODE_SMELL", "CODE_SMELL", >> >>> "CODE_SMELL", "COD~ >> >>> >> >>> [[alternative HTML version deleted]] >> >>> >> >>> ______________________________________________ >> >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> >>> >> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mail >> >>> man_listinfo_r-2Dhelp&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAs >> >>> Rzsn7AkP-g&m=HOpL0ELxWdK0xzzVxRd_DnxukD-qPEQIBxDJnlSkAQrae1FdSHYJTfWxo >> >>> RrVO5eP&s=f3IyuRfeDDjr_8UWlwyBTC5Yn4Y56QV4FjYC0GCWcVc&e= >> >>> PLEASE do read the posting guide >> >>> >> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.or >> >>> g_posting-2Dguide.html&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeA >> >>> sRzsn7AkP-g&m=HOpL0ELxWdK0xzzVxRd_DnxukD-qPEQIBxDJnlSkAQrae1FdSHYJTfWx >> >>> oRrVO5eP&s=Vo6cRRCeqGApsiEGGtA6pndDHjOIuGFOs7BOkJMvuaw&e= >> >>> and provide commented, minimal, self-contained, reproducible code. >> >>> >> >> [[alternative HTML version deleted]] >> >> >> >> ______________________________________________ >> >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> >> >> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dhelp&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=HOpL0ELxWdK0xzzVxRd_DnxukD-qPEQIBxDJnlSkAQrae1FdSHYJTfWxoRrVO5eP&s=f3IyuRfeDDjr_8UWlwyBTC5Yn4Y56QV4FjYC0GCWcVc&e= >> >> PLEASE do read the posting guide >> >> >> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org_posting-2Dguide.html&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=HOpL0ELxWdK0xzzVxRd_DnxukD-qPEQIBxDJnlSkAQrae1FdSHYJTfWxoRrVO5eP&s=Vo6cRRCeqGApsiEGGtA6pndDHjOIuGFOs7BOkJMvuaw&e= >> >> and provide commented, minimal, self-contained, reproducible code. >> >> >> >> >> > [[alternative HTML version deleted]] >> > >> > ______________________________________________ >> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> > https://stat.ethz.ch/mailman/listinfo/r-help >> > PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> > and provide commented, minimal, self-contained, reproducible code. >> > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.