Re: [R] Question re predict.glm & predict.lm in STATS

2022-02-16 Thread Bert Gunter
Ok, I looked at what you sent me privately and saw your error. I'll
reproduce and fix it just using a trivial example with lm(), for which
the predict() semantics are identical. Before I do, I note that your
claim:

"The predict.glm documentation says a warning will be given if the
length of newdata is not the same as the training set used to create
the model." is **completely wrong**. What predict.glm (and predict.lm)
actually says is:

"Variables are first looked for in newdata and then searched for in
the usual way (which will include the environment of the formula used
in the fit). A warning will be given if the variables found are not of
the same length as those in newdata if it was supplied."

This is *NOT AT ALL* what you claimed. The key point that you are
missing is the phrase 'searched for in the usual way.'  The details
are a bit technical but in many ways fundamental. They can be found in
any good tutorial or perhaps by searching on "scoping in R" or
"function environments in R". It's about how R finds the objects that
variable names point to. Section 10.7 of the Intro.R manual shipped
with R (and available to you therefore) on "Scope" gives a brief
overview.

Anyway, here's the example that explains your error:

> train <- data.frame( y = runif(10), x = runif(10)) ## 10 rows
> test <- data.frame(x = runif(5))  ## 5 rows

## The following line is the source of your error
## You have specified your model incorrectly

> mdl <- lm(train$y ~train$x, data = train)

## The model is properly fitted because the variables in it, "train$y"
and "train$x" are found  "in the usual way" in the global environment,
the "enclosing environment" of the formula. (This is the technical
bit).  This leads to the sort of problem you saw with the predict
call:

> predict(mdl, newdat = test)
1 2 3 4 5 6 7
0.6089476 0.6385268 0.9075589 0.3403276 0.2709199 0.5876634 0.8668307
8 910
0.4689961 0.2571259 0.3281054
Warning message:
'newdata' had 5 rows but variables found have 10 rows

##Explanation: predict() is looking for a variable 'train$x', but test
only has a variable 'x', not 'train$x'. Since it doesn't find it, it
goes looking for 'train$x' "in the usual way" in the global
environment and finds it -- all 10 values as before. The prediction is
done using that data (the original fit) and the warning message is
emitted as per the documentation. Predicting without the newdat
argument does the same thing.

The correct syntax for fitting the original model is:
> mdl <- lm(y ~ x, data = train)

## and then the predict() call works fine using the newdat argument
(as 'x' is found there)
> predict(mdl, newdat = test)
1 2 3 4 5
0.5134899 0.4619013 0.2458162 0.0446871 0.3146897

All of this is documented and exampled in ?glm or even ?lm or in any
tutorials on their use. Please spend the time to study these
carefully. Trying to mimic examples you find, which seems to be what
you are doing, is rarely sufficient.

Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Wed, Feb 16, 2022 at 7:24 AM Bert Gunter  wrote:
>
> You should (almost) always reply to the list to maximize your opportunity for 
> useful help. Also, I don't do private consulting.
>
> See ?dput and ?str for ways to put code and data as plain text into a post 
> via copying and pasting from the R Console. You can also just type the code 
> directly, of course. The RHelp server will strip most attachments (I think 
> .png is OK for graphs, though. You can ask on list) if necessary). I don't 
> recall whether Word makes it through, but you really shouldn't need such 
> attachments anyway.
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and 
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Wed, Feb 16, 2022 at 3:39 AM STEPHEN KAISLER  wrote:
>>
>> Bert:
>>
>> Please see the attached file which shows the approach I used.
>> Thanks for any assistance that you can offer.
>>
>> Steve Kaisler
>>
>> On 02/15/2022 4:05 PM Bert Gunter  wrote:
>>
>>
>> ??
>> Show us the error. Show us the call.
>>
>>
>> On Tue, Feb 15, 2022, 12:14 PM STEPHEN KAISLER  wrote:
>>
>> Folks:
>>
>> I haved glm/lm to build a model on a training set derived from auto_mpg data 
>> of 274 records (70% sampling)
>>
>> The test data set has 118 records.
>>
>> I am trying to use predict.glm or predict.lm to predict the values of mpg 
>> from disp, hp,weight, accel, and cyl.
>>
>> However I get the following message:
>>
>>
>> So, the resulting vector has 274 rows, when I believe it sh

Re: [R] confusion matrix like detail with continuous data?

2022-02-16 Thread Ebert,Timothy Aaron
In your prediction you will have a target level of accuracy. Something like "I 
need to predict the slope of the regression to within 1%." You break your data 
into a training and testing data sets, then for the testing data set you ask is 
the prediction within 1% of the observed value. That is about as close as I can 
come as I have trouble thinking how to get a false positive out of a regression 
with a continuous dependent variable.
   Of course, you have to have enough data that splitting the data set into two 
pieces leaves enough observations to make a reasonable model. 
Tim

-Original Message-
From: R-help  On Behalf Of Ivan Krylov
Sent: Wednesday, February 16, 2022 5:00 AM
To: r-help@r-project.org
Subject: Re: [R] confusion matrix like detail with continuous data?

[External Email]

On Tue, 15 Feb 2022 22:17:42 +0100
Neha gupta  wrote:

> (1) Can we get the details like the confusion matrix with continuous 
> data?

I think the closest you can get is a predicted-reference plot. That is, plot 
true values on the X axis and the corresponding predicted values on the Y axis.

Unsatisfying option: use cut() to transform a continuous variable into a 
categorical variable and make a confusion matrix out of that.

> (2) How can we get the mean absolute error for an individual instance? 
> For example, if the ground truth is 4 and our model predicted as 6, 
> how to find the mean absolute error for this instance?

Mathematically speaking, mean absolute error of an individual instance would be 
just the absolute value of the error in that instance, but that's probably not 
what you're looking for. If you need some kind of confidence bands for the 
predictions, it's the model's responsibility to provide them. There's lots of 
options, ranging from the use of the loss function derivative around the 
optimum to Monte-Carlo simulations.
For examples, see the confint() method.

--
Best regards,
Ivan

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dhelp&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=n0Pz_t-BEeazrrz7r5DIs0qGgfyJ0E0_F5sGlJyjhnwJRydXFvfNs1g5Pe25PGK0&s=ZeN73VTXr4Z-qwxODgOWPyhqtvKWIXp6xVsLle-eWYA&e=
PLEASE do read the posting guide 
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org_posting-2Dguide.html&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=n0Pz_t-BEeazrrz7r5DIs0qGgfyJ0E0_F5sGlJyjhnwJRydXFvfNs1g5Pe25PGK0&s=CqgXaJDSeFk1kD9-xcMjcbZYWKXSCkuJZodGf0yvRDk&e=
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.