I like option 1. Option 2 may cause problems if you are pooling groups that do 
not go together. This is especially a problem if you know that the data is 
missing some groups. I would consider dropping rare groups - or compare results 
between pooling and dropping options. If the answer is the same in both cases 
then use the approach that makes your life easier with reviewers/clients. If 
the answer is different then I would go with dropping rare categories, or 
present both and highlight the difference in outcome. A third option is to 
gather more data.

Tim

-----Original Message-----
From: R-help <r-help-boun...@r-project.org> On Behalf Of Bert Gunter
Sent: Sunday, November 20, 2022 1:06 PM
To: Mitchell Maltenfort <mmal...@gmail.com>
Cc: R-help <R-help@r-project.org>
Subject: Re: [R] test logistic regression model

[External Email]

I think (2) might be a bad idea if one of the "sparse"categories has high 
predictive power. You'll lose it when you pool, will you not?
Also, there is the problem of subjectively defining "sparse."

However, 1) seems quite sensible to me. But IANAE.

-- Bert

On Sun, Nov 20, 2022 at 9:49 AM Mitchell Maltenfort <mmal...@gmail.com> wrote:
>
> Two possible fixes occur to me
>
> 1) Redo the test/training split but within levels of factor - so you 
> have the same split within each level and each level accounted for in 
> training and testing
>
> 2) if you have a lot of levels, and perhaps sparse representation in a 
> few, consider recoding levels to pool the rare ones into an "other" 
> category
>
> On Sun, Nov 20, 2022 at 11:41 AM Bert Gunter <bgunter.4...@gmail.com> wrote:
>>
>> small reprex:
>>
>> set.seed(5)
>> dat <- data.frame(f = rep(c('r','g'),4), y = runif(8)) newdat <- 
>> data.frame(f =rep(c('r','g','b'),2)) ## convert values in newdat not 
>> seen in dat to NA
>> is.na(newdat$f) <-!( newdat$f %in% dat$f) lmfit <- lm(y~f, data = 
>> dat)
>>
>> ##Result:
>> > predict(lmfit,newdat)
>>         1         2         3         4         5         6
>> 0.4374251 0.6196527        NA 0.4374251 0.6196527        NA
>>
>> If this does not suffice, as Rui said, we need details of what you did.
>> (predict.glm works like predict.lm)
>>
>>
>> -- Bert
>>
>>
>> On Sun, Nov 20, 2022 at 7:46 AM Rui Barradas <ruipbarra...@sapo.pt> wrote:
>> >
>> > Às 15:29 de 20/11/2022, Gábor Malomsoki escreveu:
>> > > Dear Bert,
>> > >
>> > > Yes, was trying to fill the not existing categories with NAs, but 
>> > > the suggested solutions in stackoverflow.com unfortunately did not work.
>> > >
>> > > Best regards
>> > > Gabor
>> > >
>> > >
>> > > Bert Gunter <bgunter.4...@gmail.com> schrieb am So., 20. Nov. 2022, 
>> > > 16:20:
>> > >
>> > >> You can't predict results for categories that you've not seen 
>> > >> before (think about it). You will need to remove those cases 
>> > >> from your test set (or convert them to NA and predict them as NA).
>> > >>
>> > >> -- Bert
>> > >>
>> > >> On Sun, Nov 20, 2022 at 7:02 AM Gábor Malomsoki 
>> > >> <gmalomsoki1...@gmail.com>
>> > >> wrote:
>> > >>
>> > >>> Dear all,
>> > >>>
>> > >>> i have created a logistic regression model,
>> > >>>   on the train df:
>> > >>> mymodel1 <- glm(book_state ~ TG_KraftF5, data = train, family =
>> > >>> "binomial")
>> > >>>
>> > >>> then i try to predict with the test df
>> > >>> Predict<- predict(mymodel1, newdata = test, type = "response") 
>> > >>> then iget this error message:
>> > >>> Error in model.frame.default(Terms, newdata, na.action = 
>> > >>> na.action, xlev =
>> > >>> object$xlevels)
>> > >>> Factor  "TG_KraftF5" has new levels
>> > >>>
>> > >>> i have tried different proposals from stackoverflow, but 
>> > >>> unfortunately they did not solved the problem.
>> > >>> Do you have any idea how to test a logistic regression model 
>> > >>> when you have different levels in train and in test df?
>> > >>>
>> > >>> thank you in advance
>> > >>> Regards,
>> > >>> Gabor
>> > >>>
>> > >>>          [[alternative HTML version deleted]]
>> > >>>
>> > >>> ______________________________________________
>> > >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, 
>> > >>> see
>> > >>> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F
>> > >>> %2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&amp;data=05%7C01%
>> > >>> 7Ctebert%40ufl.edu%7C32b7b7b6a5d6428e728e08dacb21f524%7C0d4da0f
>> > >>> 84a314d76ace60a62331e1b84%7C0%7C0%7C638045643951801851%7CUnknow
>> > >>> n%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1
>> > >>> haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=Ceyiq3LmFfHRlfnrw
>> > >>> 87wzELUGTHLSv7qvuv1tyqGruU%3D&amp;reserved=0
>> > >>> PLEASE do read the posting guide
>> > >>> https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%
>> > >>> 2Fwww.r-project.org%2Fposting-guide.html&amp;data=05%7C01%7Cteb
>> > >>> ert%40ufl.edu%7C32b7b7b6a5d6428e728e08dacb21f524%7C0d4da0f84a31
>> > >>> 4d76ace60a62331e1b84%7C0%7C0%7C638045643951958086%7CUnknown%7CT
>> > >>> WFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwi
>> > >>> LCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=swql970slrq8f9bAwP%2FE
>> > >>> s7PbWm5EQvFHWNga2JwHWeY%3D&amp;reserved=0
>> > >>> and provide commented, minimal, self-contained, reproducible code.
>> > >>>
>> > >>
>> > >
>> > >       [[alternative HTML version deleted]]
>> > >
>> > > ______________________________________________
>> > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> > > https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2
>> > > Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&amp;data=05%7C01%7Cte
>> > > bert%40ufl.edu%7C32b7b7b6a5d6428e728e08dacb21f524%7C0d4da0f84a314
>> > > d76ace60a62331e1b84%7C0%7C0%7C638045643951958086%7CUnknown%7CTWFp
>> > > bGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXV
>> > > CI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=N2g%2Fx2IMW4OL0HSmq6pP2pxymP0
>> > > FUAQbciQXRPOe7KM%3D&amp;reserved=0
>> > > PLEASE do read the posting guide 
>> > > https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2F
>> > > www.r-project.org%2Fposting-guide.html&amp;data=05%7C01%7Ctebert%
>> > > 40ufl.edu%7C32b7b7b6a5d6428e728e08dacb21f524%7C0d4da0f84a314d76ac
>> > > e60a62331e1b84%7C0%7C0%7C638045643951958086%7CUnknown%7CTWFpbGZsb
>> > > 3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn
>> > > 0%3D%7C3000%7C%7C%7C&amp;sdata=swql970slrq8f9bAwP%2FEs7PbWm5EQvFH
>> > > WNga2JwHWeY%3D&amp;reserved=0 and provide commented, minimal, 
>> > > self-contained, reproducible code.
>> >
>> > hello,
>> >
>> > What exactly didn't work? You say you have tried the solutions 
>> > found in stackoverflow but without a link, we don't know which 
>> > answers to which questions you are talking about.
>> > Like Bert said, if you assign NA to the new levels, present only in 
>> > test, it should work.
>> >
>> > Can you post links to what you have tried?
>> >
>> > Hope this helps,
>> >
>> > Rui Barradas
>>
>> ______________________________________________
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsta
>> t.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&amp;data=05%7C01%7Ctebert%40u
>> fl.edu%7C32b7b7b6a5d6428e728e08dacb21f524%7C0d4da0f84a314d76ace60a623
>> 31e1b84%7C0%7C0%7C638045643951958086%7CUnknown%7CTWFpbGZsb3d8eyJWIjoi
>> MC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%
>> 7C%7C&amp;sdata=N2g%2Fx2IMW4OL0HSmq6pP2pxymP0FUAQbciQXRPOe7KM%3D&amp;
>> reserved=0 PLEASE do read the posting guide 
>> https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.
>> r-project.org%2Fposting-guide.html&amp;data=05%7C01%7Ctebert%40ufl.ed
>> u%7C32b7b7b6a5d6428e728e08dacb21f524%7C0d4da0f84a314d76ace60a62331e1b
>> 84%7C0%7C0%7C638045643951958086%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wL
>> jAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
>> &amp;sdata=swql970slrq8f9bAwP%2FEs7PbWm5EQvFHWNga2JwHWeY%3D&amp;reser
>> ved=0 and provide commented, minimal, self-contained, reproducible 
>> code.
>
> --
> Sent from Gmail Mobile

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&amp;data=05%7C01%7Ctebert%40ufl.edu%7C32b7b7b6a5d6428e728e08dacb21f524%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7C638045643951958086%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=N2g%2Fx2IMW4OL0HSmq6pP2pxymP0FUAQbciQXRPOe7KM%3D&amp;reserved=0
PLEASE do read the posting guide 
https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-project.org%2Fposting-guide.html&amp;data=05%7C01%7Ctebert%40ufl.edu%7C32b7b7b6a5d6428e728e08dacb21f524%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7C638045643951958086%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=swql970slrq8f9bAwP%2FEs7PbWm5EQvFHWNga2JwHWeY%3D&amp;reserved=0
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to