Re: [R] Problem while predicting in regression trees

2016-05-10 Thread Muhammad Bilal
Many thanks Max for these valuable suggestions.


--
Muhammad Bilal
Research Fellow and Doctoral Researcher,
Bristol Enterprise, Research, and Innovation Centre (BERIC),
University of the West of England (UWE),
Frenchay Campus,
Bristol,
BS16 1QY

muhammad2.bi...@live.uwe.ac.uk<mailto:olugbenga2.akin...@live.uwe.ac.uk>



From: Max Kuhn <mxk...@gmail.com>
Sent: 09 May 2016 23:22:30
To: Muhammad Bilal
Cc: Bert Gunter; r-help@r-project.org
Subject: Re: [R] Problem while predicting in regression trees

I've brought this up numerous times... you shouldn't use `predict.rpart` (or 
whatever modeling function) from the `finalModel` object. That object has no 
idea what was done to the data prior to its invocation.

The issue here is that `train(formula)` converts the factors to dummy 
variables. `rpart` does not require that and the `finalModel` object has no 
idea that that happened. Using `predict.train` works just fine so why not use 
it?

> table(predict(tr_m, newdata = testPFI))

-2617.42857142857 -1786.76923076923 -1777.583   -1217.3
3 3 6 3
-886.6667  -408.375-375.7 -240.307692307692
5 1 4 5
-201.612903225806 -19.6071428571429  30.80833  43.9
   307266 9
151.5  209.647058823529
628

On Mon, May 9, 2016 at 2:46 PM, Muhammad Bilal 
<muhammad2.bi...@live.uwe.ac.uk<mailto:muhammad2.bi...@live.uwe.ac.uk>> wrote:

Please find the sample dataset attached along with R code pasted below to 
reproduce the issue.


#Loading the data frame

pfi <- read.csv("pfi_data.csv")

#Splitting the data into training and test sets
split <- sample.split(pfi, SplitRatio = 0.7)
trainPFI <- subset(pfi, split == TRUE)
testPFI <- subset(pfi, split == FALSE)

#Cross validating the decision trees
tr.control <- trainControl(method="repeatedcv", number=20)
cp.grid <- expand.grid(.cp = (0:10)*0.001)
tr_m <- train(project_delay ~ project_lon + project_lat + project_duration + 
sector + contract_type + capital_value, data = trainPFI, method="rpart", 
trControl=tr.control, tuneGrid = cp.grid)

#Displaying the train results
tr_m

#Fetching the best tree
best_tree <- tr_m$finalModel

#Plotting the best tree
prp(best_tree)

#Using the best tree to make predictions [This command raises the error]
best_tree_pred <- predict(best_tree, newdata = testPFI)

#Calculating the SSE
best_tree_pred.sse <- sum((best_tree_pred - testPFI$project_delay)^2)

#
tree_pred.sse

...


Many Thanks and


Kind Regards



--
Muhammad Bilal
Research Fellow and Doctoral Researcher,
Bristol Enterprise, Research, and Innovation Centre (BERIC),
University of the West of England (UWE),
Frenchay Campus,
Bristol,
BS16 1QY

muhammad2.bi...@live.uwe.ac.uk<mailto:olugbenga2.akin...@live.uwe.ac.uk>



From: Max Kuhn <mxk...@gmail.com<mailto:mxk...@gmail.com>>
Sent: 09 May 2016 17:22:22
To: Muhammad Bilal
Cc: Bert Gunter; r-help@r-project.org<mailto:r-help@r-project.org>

Subject: Re: [R] Problem while predicting in regression trees

It is extremely difficult to tell what the issue might be without a 
reproducible example.

The only thing that I can suggest is to use the non-formula interface to 
`train` so that you can avoid creating dummy variables.

On Mon, May 9, 2016 at 11:23 AM, Muhammad Bilal 
<muhammad2.bi...@live.uwe.ac.uk<mailto:muhammad2.bi...@live.uwe.ac.uk>> wrote:
Hi Bert,

Thanks for the response.

I checked the datasets, however, the Hospitals level appears in both of them. 
See the output below:

> sqldf("SELECT sector, count(*) FROM trainPFI GROUP BY sector")
sector count(*)
1  Defense9
2Hospitals  101
3  Housing   32
4   Others   99
5 Public Buildings   39
6  Schools  148
7  Social Care   10
8  Transportation   27
9Waste   26
> sqldf("SELECT sector, count(*) FROM testPFI GROUP BY sector")
sector count(*)
1  Defense5
2Hospitals   47
3  Housing   11
4   Others   44
5 Public Buildings   18
6  Schools   69
7  Social Care9
8   Transportation8
9Waste   12

Any thing else to try?

--
Muhammad Bilal
Research Fellow and Doctoral Researcher,
Bristol Enterprise, Research, and Innovation Centre (BERIC),
University of the West of England (UWE),
Frenchay Campus,
Bristol,
BS16 1QY

muhammad2.bi...@live.uwe.ac.uk<mailto:muhammad2.bi...@live.uwe.ac.uk>



From: Bert Gunter <bgunter.4...@gmail.com<m

Re: [R] Problem while predicting in regression trees

2016-05-09 Thread Max Kuhn
I've brought this up numerous times... you shouldn't use `predict.rpart`
(or whatever modeling function) from the `finalModel` object. That object
has no idea what was done to the data prior to its invocation.

The issue here is that `train(formula)` converts the factors to dummy
variables. `rpart` does not require that and the `finalModel` object has no
idea that that happened. Using `predict.train` works just fine so why not
use it?

> table(predict(tr_m, newdata = testPFI))

-2617.42857142857 -1786.76923076923 -1777.583   -1217.3
3 3 6 3
-886.6667  -408.375-375.7 -240.307692307692
5 1 4 5
-201.612903225806 -19.6071428571429  30.80833  43.9
   307266 9
151.5  209.647058823529
628

On Mon, May 9, 2016 at 2:46 PM, Muhammad Bilal <
muhammad2.bi...@live.uwe.ac.uk> wrote:

> Please find the sample dataset attached along with R code pasted below to
> reproduce the issue.
>
>
> #Loading the data frame
>
> pfi <- read.csv("pfi_data.csv")
>
> #Splitting the data into training and test sets
> split <- sample.split(pfi, SplitRatio = 0.7)
> trainPFI <- subset(pfi, split == TRUE)
> testPFI <- subset(pfi, split == FALSE)
>
> #Cross validating the decision trees
> tr.control <- trainControl(method="repeatedcv", number=20)
> cp.grid <- expand.grid(.cp = (0:10)*0.001)
> tr_m <- train(project_delay ~ project_lon + project_lat + project_duration
> + sector + contract_type + capital_value, data = trainPFI, method="rpart",
> trControl=tr.control, tuneGrid = cp.grid)
>
> #Displaying the train results
> tr_m
>
> #Fetching the best tree
> best_tree <- tr_m$finalModel
>
> #Plotting the best tree
> prp(best_tree)
>
> #Using the best tree to make predictions *[This command raises the error]*
> best_tree_pred <- predict(best_tree, newdata = testPFI)
>
> #Calculating the SSE
> best_tree_pred.sse <- sum((best_tree_pred - testPFI$project_delay)^2)
>
> #
> tree_pred.sse
>
> ...
>
> Many Thanks and
>
>
> Kind Regards
>
>
>
> --
> Muhammad Bilal
> Research Fellow and Doctoral Researcher,
> Bristol Enterprise, Research, and Innovation Centre (BERIC),
> University of the West of England (UWE),
> Frenchay Campus,
> Bristol,
> BS16 1QY
>
> *muhammad2.bi...@live.uwe.ac.uk* <olugbenga2.akin...@live.uwe.ac.uk>
>
>
> --
> *From:* Max Kuhn <mxk...@gmail.com>
> *Sent:* 09 May 2016 17:22:22
> *To:* Muhammad Bilal
> *Cc:* Bert Gunter; r-help@r-project.org
>
> *Subject:* Re: [R] Problem while predicting in regression trees
>
> It is extremely difficult to tell what the issue might be without a
> reproducible example.
>
> The only thing that I can suggest is to use the non-formula interface to
> `train` so that you can avoid creating dummy variables.
>
> On Mon, May 9, 2016 at 11:23 AM, Muhammad Bilal <
> muhammad2.bi...@live.uwe.ac.uk> wrote:
>
>> Hi Bert,
>>
>> Thanks for the response.
>>
>> I checked the datasets, however, the Hospitals level appears in both of
>> them. See the output below:
>>
>> > sqldf("SELECT sector, count(*) FROM trainPFI GROUP BY sector")
>> sector count(*)
>> 1  Defense9
>> 2Hospitals  101
>> 3  Housing   32
>> 4   Others   99
>> 5 Public Buildings   39
>> 6  Schools  148
>> 7  Social Care   10
>> 8  Transportation   27
>> 9Waste   26
>> > sqldf("SELECT sector, count(*) FROM testPFI GROUP BY sector")
>> sector count(*)
>> 1  Defense5
>> 2Hospitals   47
>> 3  Housing   11
>> 4   Others   44
>> 5 Public Buildings   18
>> 6  Schools   69
>> 7  Social Care9
>> 8   Transportation8
>> 9    Waste   12
>>
>> Any thing else to try?
>>
>> --
>> Muhammad Bilal
>> Research Fellow and Doctoral Researcher,
>> Bristol Enterprise, Research, and Innovation Centre (BERIC),
>> University of the West of England (UWE),
>> Frenchay Campus,
>> Bristol,
>> BS16 1QY
>>
>> muhammad2.bi...@live.uwe.ac.uk
>>
>>
>> 
>> From: Bert Gunter <bgunter.4...@gmail.com>

Re: [R] Problem while predicting in regression trees

2016-05-09 Thread Muhammad Bilal
Hi Bill,


Many thanks for highlighting the issue. It worked as I predicted using the 
tr_m. I'm extremely grateful for the insight.


Thanks for all who gave me prior guidance as well.


--
Muhammad Bilal
Research Fellow and Doctoral Researcher,
Bristol Enterprise, Research, and Innovation Centre (BERIC),
University of the West of England (UWE),
Frenchay Campus,
Bristol,
BS16 1QY

muhammad2.bi...@live.uwe.ac.uk<mailto:olugbenga2.akin...@live.uwe.ac.uk>



From: William Dunlap <wdun...@tibco.com>
Sent: 09 May 2016 20:27:14
To: Muhammad Bilal
Cc: Max Kuhn; r-help@r-project.org
Subject: Re: [R] Problem while predicting in regression trees

Why are you predicting from tr_m$finalModel instead of from tr_m?

Bill Dunlap
TIBCO Software
wdunlap tibco.com<http://tibco.com>

On Mon, May 9, 2016 at 11:46 AM, Muhammad Bilal 
<muhammad2.bi...@live.uwe.ac.uk<mailto:muhammad2.bi...@live.uwe.ac.uk>> wrote:
Please find the sample dataset attached along with R code pasted below to 
reproduce the issue.


#Loading the data frame

pfi <- read.csv("pfi_data.csv")

#Splitting the data into training and test sets
split <- sample.split(pfi, SplitRatio = 0.7)
trainPFI <- subset(pfi, split == TRUE)
testPFI <- subset(pfi, split == FALSE)

#Cross validating the decision trees
tr.control <- trainControl(method="repeatedcv", number=20)
cp.grid <- expand.grid(.cp = (0:10)*0.001)
tr_m <- train(project_delay ~ project_lon + project_lat + project_duration + 
sector + contract_type + capital_value, data = trainPFI, method="rpart", 
trControl=tr.control, tuneGrid = cp.grid)

#Displaying the train results
tr_m

#Fetching the best tree
best_tree <- tr_m$finalModel

#Plotting the best tree
prp(best_tree)

#Using the best tree to make predictions [This command raises the error]
best_tree_pred <- predict(best_tree, newdata = testPFI)

#Calculating the SSE
best_tree_pred.sse <- sum((best_tree_pred - testPFI$project_delay)^2)

#
tree_pred.sse

...


Many Thanks and


Kind Regards



--
Muhammad Bilal
Research Fellow and Doctoral Researcher,
Bristol Enterprise, Research, and Innovation Centre (BERIC),
University of the West of England (UWE),
Frenchay Campus,
Bristol,
BS16 1QY

muhammad2.bi...@live.uwe.ac.uk<mailto:muhammad2.bi...@live.uwe.ac.uk><mailto:olugbenga2.akin...@live.uwe.ac.uk<mailto:olugbenga2.akin...@live.uwe.ac.uk>>



From: Max Kuhn <mxk...@gmail.com<mailto:mxk...@gmail.com>>
Sent: 09 May 2016 17:22:22
To: Muhammad Bilal
Cc: Bert Gunter; r-help@r-project.org<mailto:r-help@r-project.org>
Subject: Re: [R] Problem while predicting in regression trees

It is extremely difficult to tell what the issue might be without a 
reproducible example.

The only thing that I can suggest is to use the non-formula interface to 
`train` so that you can avoid creating dummy variables.

On Mon, May 9, 2016 at 11:23 AM, Muhammad Bilal 
<muhammad2.bi...@live.uwe.ac.uk<mailto:muhammad2.bi...@live.uwe.ac.uk><mailto:muhammad2.bi...@live.uwe.ac.uk<mailto:muhammad2.bi...@live.uwe.ac.uk>>>
 wrote:
Hi Bert,

Thanks for the response.

I checked the datasets, however, the Hospitals level appears in both of them. 
See the output below:

> sqldf("SELECT sector, count(*) FROM trainPFI GROUP BY sector")
sector count(*)
1  Defense9
2Hospitals  101
3  Housing   32
4   Others   99
5 Public Buildings   39
6  Schools  148
7  Social Care   10
8  Transportation   27
9Waste   26
> sqldf("SELECT sector, count(*) FROM testPFI GROUP BY sector")
sector count(*)
1  Defense5
2Hospitals   47
3  Housing   11
4   Others   44
5 Public Buildings   18
6  Schools   69
7  Social Care9
8   Transportation8
9Waste   12

Any thing else to try?

--
Muhammad Bilal
Research Fellow and Doctoral Researcher,
Bristol Enterprise, Research, and Innovation Centre (BERIC),
University of the West of England (UWE),
Frenchay Campus,
Bristol,
BS16 1QY

muhammad2.bi...@live.uwe.ac.uk<mailto:muhammad2.bi...@live.uwe.ac.uk><mailto:muhammad2.bi...@live.uwe.ac.uk<mailto:muhammad2.bi...@live.uwe.ac.uk>>



From: Bert Gunter 
<bgunter.4...@gmail.com<mailto:bgunter.4...@gmail.com><mailto:bgunter.4...@gmail.com<mailto:bgunter.4...@gmail.com>>>
Sent: 09 May 2016 01:42:39
To: Muhammad Bilal
Cc: 
r-help@r-project.org<mailto:r-help@r-project.org><mailto:r-help@r-project.org<mailto:r-help@r-project.org>>
Subject: Re: [R] Problem while predicting in regression trees

It seems that the data that you used for prediction contained a level
"Hospitals" for the 

Re: [R] Problem while predicting in regression trees

2016-05-09 Thread William Dunlap via R-help
Why are you predicting from tr_m$finalModel instead of from tr_m?

Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Mon, May 9, 2016 at 11:46 AM, Muhammad Bilal <
muhammad2.bi...@live.uwe.ac.uk> wrote:

> Please find the sample dataset attached along with R code pasted below to
> reproduce the issue.
>
>
> #Loading the data frame
>
> pfi <- read.csv("pfi_data.csv")
>
> #Splitting the data into training and test sets
> split <- sample.split(pfi, SplitRatio = 0.7)
> trainPFI <- subset(pfi, split == TRUE)
> testPFI <- subset(pfi, split == FALSE)
>
> #Cross validating the decision trees
> tr.control <- trainControl(method="repeatedcv", number=20)
> cp.grid <- expand.grid(.cp = (0:10)*0.001)
> tr_m <- train(project_delay ~ project_lon + project_lat + project_duration
> + sector + contract_type + capital_value, data = trainPFI, method="rpart",
> trControl=tr.control, tuneGrid = cp.grid)
>
> #Displaying the train results
> tr_m
>
> #Fetching the best tree
> best_tree <- tr_m$finalModel
>
> #Plotting the best tree
> prp(best_tree)
>
> #Using the best tree to make predictions [This command raises the error]
> best_tree_pred <- predict(best_tree, newdata = testPFI)
>
> #Calculating the SSE
> best_tree_pred.sse <- sum((best_tree_pred - testPFI$project_delay)^2)
>
> #
> tree_pred.sse
>
> ...
>
>
> Many Thanks and
>
>
> Kind Regards
>
>
>
> --
> Muhammad Bilal
> Research Fellow and Doctoral Researcher,
> Bristol Enterprise, Research, and Innovation Centre (BERIC),
> University of the West of England (UWE),
> Frenchay Campus,
> Bristol,
> BS16 1QY
>
> muhammad2.bi...@live.uwe.ac.uk<mailto:olugbenga2.akin...@live.uwe.ac.uk>
>
>
> 
> From: Max Kuhn <mxk...@gmail.com>
> Sent: 09 May 2016 17:22:22
> To: Muhammad Bilal
> Cc: Bert Gunter; r-help@r-project.org
> Subject: Re: [R] Problem while predicting in regression trees
>
> It is extremely difficult to tell what the issue might be without a
> reproducible example.
>
> The only thing that I can suggest is to use the non-formula interface to
> `train` so that you can avoid creating dummy variables.
>
> On Mon, May 9, 2016 at 11:23 AM, Muhammad Bilal <
> muhammad2.bi...@live.uwe.ac.uk<mailto:muhammad2.bi...@live.uwe.ac.uk>>
> wrote:
> Hi Bert,
>
> Thanks for the response.
>
> I checked the datasets, however, the Hospitals level appears in both of
> them. See the output below:
>
> > sqldf("SELECT sector, count(*) FROM trainPFI GROUP BY sector")
> sector count(*)
> 1  Defense9
> 2Hospitals  101
> 3  Housing   32
> 4   Others   99
> 5 Public Buildings   39
> 6  Schools  148
> 7  Social Care   10
> 8  Transportation   27
> 9Waste   26
> > sqldf("SELECT sector, count(*) FROM testPFI GROUP BY sector")
> sector count(*)
> 1  Defense5
> 2Hospitals   47
> 3  Housing   11
> 4   Others   44
> 5 Public Buildings   18
> 6  Schools   69
> 7  Social Care9
> 8   Transportation8
> 9Waste   12
>
> Any thing else to try?
>
> --
> Muhammad Bilal
> Research Fellow and Doctoral Researcher,
> Bristol Enterprise, Research, and Innovation Centre (BERIC),
> University of the West of England (UWE),
> Frenchay Campus,
> Bristol,
> BS16 1QY
>
> muhammad2.bi...@live.uwe.ac.uk<mailto:muhammad2.bi...@live.uwe.ac.uk>
>
>
> 
> From: Bert Gunter <bgunter.4...@gmail.com<mailto:bgunter.4...@gmail.com>>
> Sent: 09 May 2016 01:42:39
> To: Muhammad Bilal
> Cc: r-help@r-project.org<mailto:r-help@r-project.org>
> Subject: Re: [R] Problem while predicting in regression trees
>
> It seems that the data that you used for prediction contained a level
> "Hospitals" for the sector factor that did not appear in the training
> data (or maybe it's the other way round). Check this.
>
> Cheers,
> Bert
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Sun, May 8, 2016 at 4:14 PM, Muhammad Bilal
> <muhammad2.bi...@live.uwe.ac.uk<mailto:muhammad2.bi...@live.uwe.ac.uk>>
> wrote:
> > Hi All,
> >
> > I have the following script, that raises error at th

Re: [R] Problem while predicting in regression trees

2016-05-09 Thread Muhammad Bilal
The dataset could also be downloaded from the following link:

https://www.dropbox.com/s/kkiwm32jxfk7jac/pfi_data.csv?dl=0

[https://cf.dropboxstatic.com/static/images/icons128/page_white_excel.png]<https://www.dropbox.com/s/kkiwm32jxfk7jac/pfi_data.csv?dl=0>

pfi_data.csv<https://www.dropbox.com/s/kkiwm32jxfk7jac/pfi_data.csv?dl=0>
www.dropbox.com
Shared with Dropbox





--
Muhammad Bilal
Research Fellow and Doctoral Researcher,
Bristol Enterprise, Research, and Innovation Centre (BERIC),
University of the West of England (UWE),
Frenchay Campus,
Bristol,
BS16 1QY

muhammad2.bi...@live.uwe.ac.uk<mailto:olugbenga2.akin...@live.uwe.ac.uk>



From: Max Kuhn <mxk...@gmail.com>
Sent: 09 May 2016 17:22:22
To: Muhammad Bilal
Cc: Bert Gunter; r-help@r-project.org
Subject: Re: [R] Problem while predicting in regression trees

It is extremely difficult to tell what the issue might be without a 
reproducible example.

The only thing that I can suggest is to use the non-formula interface to 
`train` so that you can avoid creating dummy variables.

On Mon, May 9, 2016 at 11:23 AM, Muhammad Bilal 
<muhammad2.bi...@live.uwe.ac.uk<mailto:muhammad2.bi...@live.uwe.ac.uk>> wrote:
Hi Bert,

Thanks for the response.

I checked the datasets, however, the Hospitals level appears in both of them. 
See the output below:

> sqldf("SELECT sector, count(*) FROM trainPFI GROUP BY sector")
sector count(*)
1  Defense9
2Hospitals  101
3  Housing   32
4   Others   99
5 Public Buildings   39
6  Schools  148
7  Social Care   10
8  Transportation   27
9Waste   26
> sqldf("SELECT sector, count(*) FROM testPFI GROUP BY sector")
sector count(*)
1  Defense5
2Hospitals   47
3  Housing   11
4   Others   44
5 Public Buildings   18
6  Schools   69
7  Social Care9
8   Transportation8
9Waste   12

Any thing else to try?

--
Muhammad Bilal
Research Fellow and Doctoral Researcher,
Bristol Enterprise, Research, and Innovation Centre (BERIC),
University of the West of England (UWE),
Frenchay Campus,
Bristol,
BS16 1QY

muhammad2.bi...@live.uwe.ac.uk<mailto:muhammad2.bi...@live.uwe.ac.uk>



From: Bert Gunter <bgunter.4...@gmail.com<mailto:bgunter.4...@gmail.com>>
Sent: 09 May 2016 01:42:39
To: Muhammad Bilal
Cc: r-help@r-project.org<mailto:r-help@r-project.org>
Subject: Re: [R] Problem while predicting in regression trees

It seems that the data that you used for prediction contained a level
"Hospitals" for the sector factor that did not appear in the training
data (or maybe it's the other way round). Check this.

Cheers,
Bert


Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Sun, May 8, 2016 at 4:14 PM, Muhammad Bilal
<muhammad2.bi...@live.uwe.ac.uk<mailto:muhammad2.bi...@live.uwe.ac.uk>> wrote:
> Hi All,
>
> I have the following script, that raises error at the last command. I am new 
> to R and require some clarification on what is going wrong.
>
> #Creating the training and testing data sets
> splitFlag <- sample.split(pfi_v3, SplitRatio = 0.7)
> trainPFI <- subset(pfi_v3, splitFlag==TRUE)
> testPFI <- subset(pfi_v3, splitFlag==FALSE)
>
>
> #Structure of the trainPFI data frame
>> str(trainPFI)
> ***
> 'data.frame': 491 obs. of  16 variables:
>  $ project_id : int  1 2 3 6 7 9 10 12 13 14 ...
>  $ project_lat: num  51.4 51.5 52.2 51.9 52.5 ...
>  $ project_lon: num  -0.642 -1.85 0.08 -0.401 -1.888 ...
>  $ sector : Factor w/ 9 levels "Defense","Hospitals",..: 4 4 
> 4 6 6 6 6 6 6 6 ...
>  $ contract_type  : chr  "Turnkey" "Turnkey" "Turnkey" "Turnkey" ...
>  $ project_duration   : int  1826 3652 121 730 730 790 522 819 998 372 ...
>  $ project_delay  : int  -323 0 -60 0 0 0 -91 0 0 7 ...
>  $ capital_value  : num  6.7 5.8 21.8 24.2 40.7 10.7 70 24.5 60.5 78 
> ...
>  $ project_delay_pct  : num  -17.7 0 -49.6 0 0 0 -17.4 0 0 1.9 ...
>  $ delay_type : Ord.factor w/ 9 levels "7 months early & 
> beyond"<..: 1 5 3 5 5 5 2 5 5 6 ...
>
> library(caret)
> library(e1071)
>
> set.seed(100)
>
> tr.control <- trainControl(method="cv", number=10)
> cp.grid <- expand.grid(.cp = (0:10)*0.001)
>
> #Fitting the model using regression tree
> tr_m <- train(project_delay ~ project_lon + pro

Re: [R] Problem while predicting in regression trees

2016-05-09 Thread Muhammad Bilal
Please find the sample dataset attached along with R code pasted below to 
reproduce the issue.


#Loading the data frame

pfi <- read.csv("pfi_data.csv")

#Splitting the data into training and test sets
split <- sample.split(pfi, SplitRatio = 0.7)
trainPFI <- subset(pfi, split == TRUE)
testPFI <- subset(pfi, split == FALSE)

#Cross validating the decision trees
tr.control <- trainControl(method="repeatedcv", number=20)
cp.grid <- expand.grid(.cp = (0:10)*0.001)
tr_m <- train(project_delay ~ project_lon + project_lat + project_duration + 
sector + contract_type + capital_value, data = trainPFI, method="rpart", 
trControl=tr.control, tuneGrid = cp.grid)

#Displaying the train results
tr_m

#Fetching the best tree
best_tree <- tr_m$finalModel

#Plotting the best tree
prp(best_tree)

#Using the best tree to make predictions [This command raises the error]
best_tree_pred <- predict(best_tree, newdata = testPFI)

#Calculating the SSE
best_tree_pred.sse <- sum((best_tree_pred - testPFI$project_delay)^2)

#
tree_pred.sse

...


Many Thanks and


Kind Regards



--
Muhammad Bilal
Research Fellow and Doctoral Researcher,
Bristol Enterprise, Research, and Innovation Centre (BERIC),
University of the West of England (UWE),
Frenchay Campus,
Bristol,
BS16 1QY

muhammad2.bi...@live.uwe.ac.uk<mailto:olugbenga2.akin...@live.uwe.ac.uk>



From: Max Kuhn <mxk...@gmail.com>
Sent: 09 May 2016 17:22:22
To: Muhammad Bilal
Cc: Bert Gunter; r-help@r-project.org
Subject: Re: [R] Problem while predicting in regression trees

It is extremely difficult to tell what the issue might be without a 
reproducible example.

The only thing that I can suggest is to use the non-formula interface to 
`train` so that you can avoid creating dummy variables.

On Mon, May 9, 2016 at 11:23 AM, Muhammad Bilal 
<muhammad2.bi...@live.uwe.ac.uk<mailto:muhammad2.bi...@live.uwe.ac.uk>> wrote:
Hi Bert,

Thanks for the response.

I checked the datasets, however, the Hospitals level appears in both of them. 
See the output below:

> sqldf("SELECT sector, count(*) FROM trainPFI GROUP BY sector")
sector count(*)
1  Defense9
2Hospitals  101
3  Housing   32
4   Others   99
5 Public Buildings   39
6  Schools  148
7  Social Care   10
8  Transportation   27
9Waste   26
> sqldf("SELECT sector, count(*) FROM testPFI GROUP BY sector")
sector count(*)
1  Defense5
2Hospitals   47
3  Housing   11
4   Others   44
5 Public Buildings   18
6  Schools   69
7  Social Care9
8   Transportation8
9Waste   12

Any thing else to try?

--
Muhammad Bilal
Research Fellow and Doctoral Researcher,
Bristol Enterprise, Research, and Innovation Centre (BERIC),
University of the West of England (UWE),
Frenchay Campus,
Bristol,
BS16 1QY

muhammad2.bi...@live.uwe.ac.uk<mailto:muhammad2.bi...@live.uwe.ac.uk>



From: Bert Gunter <bgunter.4...@gmail.com<mailto:bgunter.4...@gmail.com>>
Sent: 09 May 2016 01:42:39
To: Muhammad Bilal
Cc: r-help@r-project.org<mailto:r-help@r-project.org>
Subject: Re: [R] Problem while predicting in regression trees

It seems that the data that you used for prediction contained a level
"Hospitals" for the sector factor that did not appear in the training
data (or maybe it's the other way round). Check this.

Cheers,
Bert


Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Sun, May 8, 2016 at 4:14 PM, Muhammad Bilal
<muhammad2.bi...@live.uwe.ac.uk<mailto:muhammad2.bi...@live.uwe.ac.uk>> wrote:
> Hi All,
>
> I have the following script, that raises error at the last command. I am new 
> to R and require some clarification on what is going wrong.
>
> #Creating the training and testing data sets
> splitFlag <- sample.split(pfi_v3, SplitRatio = 0.7)
> trainPFI <- subset(pfi_v3, splitFlag==TRUE)
> testPFI <- subset(pfi_v3, splitFlag==FALSE)
>
>
> #Structure of the trainPFI data frame
>> str(trainPFI)
> ***
> 'data.frame': 491 obs. of  16 variables:
>  $ project_id : int  1 2 3 6 7 9 10 12 13 14 ...
>  $ project_lat: num  51.4 51.5 52.2 51.9 52.5 ...
>  $ project_lon: num  -0.642 -1.85 0.08 -0.401 -1.888 ...
>  $ sector : Factor w/ 9 levels "Defense","Hospitals",..: 4 4 
> 4 6 6 6 6 6 6 6 ...
>  $ contract_type  : chr  "Turnkey" "Turnkey" "Turnkey" "Turnkey" ...
>  $ pr

Re: [R] Problem while predicting in regression trees

2016-05-09 Thread Max Kuhn
It is extremely difficult to tell what the issue might be without a
reproducible example.

The only thing that I can suggest is to use the non-formula interface to
`train` so that you can avoid creating dummy variables.

On Mon, May 9, 2016 at 11:23 AM, Muhammad Bilal <
muhammad2.bi...@live.uwe.ac.uk> wrote:

> Hi Bert,
>
> Thanks for the response.
>
> I checked the datasets, however, the Hospitals level appears in both of
> them. See the output below:
>
> > sqldf("SELECT sector, count(*) FROM trainPFI GROUP BY sector")
> sector count(*)
> 1  Defense9
> 2Hospitals  101
> 3  Housing   32
> 4   Others   99
> 5 Public Buildings   39
> 6  Schools  148
> 7  Social Care   10
> 8  Transportation   27
> 9Waste   26
> > sqldf("SELECT sector, count(*) FROM testPFI GROUP BY sector")
> sector count(*)
> 1  Defense5
> 2Hospitals   47
> 3  Housing   11
> 4   Others   44
> 5 Public Buildings   18
> 6  Schools   69
> 7  Social Care9
> 8   Transportation8
> 9Waste   12
>
> Any thing else to try?
>
> --
> Muhammad Bilal
> Research Fellow and Doctoral Researcher,
> Bristol Enterprise, Research, and Innovation Centre (BERIC),
> University of the West of England (UWE),
> Frenchay Campus,
> Bristol,
> BS16 1QY
>
> muhammad2.bi...@live.uwe.ac.uk
>
>
> ____________
> From: Bert Gunter <bgunter.4...@gmail.com>
> Sent: 09 May 2016 01:42:39
> To: Muhammad Bilal
> Cc: r-help@r-project.org
> Subject: Re: [R] Problem while predicting in regression trees
>
> It seems that the data that you used for prediction contained a level
> "Hospitals" for the sector factor that did not appear in the training
> data (or maybe it's the other way round). Check this.
>
> Cheers,
> Bert
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Sun, May 8, 2016 at 4:14 PM, Muhammad Bilal
> <muhammad2.bi...@live.uwe.ac.uk> wrote:
> > Hi All,
> >
> > I have the following script, that raises error at the last command. I am
> new to R and require some clarification on what is going wrong.
> >
> > #Creating the training and testing data sets
> > splitFlag <- sample.split(pfi_v3, SplitRatio = 0.7)
> > trainPFI <- subset(pfi_v3, splitFlag==TRUE)
> > testPFI <- subset(pfi_v3, splitFlag==FALSE)
> >
> >
> > #Structure of the trainPFI data frame
> >> str(trainPFI)
> > ***
> > 'data.frame': 491 obs. of  16 variables:
> >  $ project_id : int  1 2 3 6 7 9 10 12 13 14 ...
> >  $ project_lat: num  51.4 51.5 52.2 51.9 52.5 ...
> >  $ project_lon: num  -0.642 -1.85 0.08 -0.401 -1.888 ...
> >  $ sector : Factor w/ 9 levels "Defense","Hospitals",..:
> 4 4 4 6 6 6 6 6 6 6 ...
> >  $ contract_type  : chr  "Turnkey" "Turnkey" "Turnkey" "Turnkey"
> ...
> >  $ project_duration   : int  1826 3652 121 730 730 790 522 819 998
> 372 ...
> >  $ project_delay  : int  -323 0 -60 0 0 0 -91 0 0 7 ...
> >  $ capital_value  : num  6.7 5.8 21.8 24.2 40.7 10.7 70 24.5
> 60.5 78 ...
> >  $ project_delay_pct  : num  -17.7 0 -49.6 0 0 0 -17.4 0 0 1.9 ...
> >  $ delay_type : Ord.factor w/ 9 levels "7 months early &
> beyond"<..: 1 5 3 5 5 5 2 5 5 6 ...
> >
> > library(caret)
> > library(e1071)
> >
> > set.seed(100)
> >
> > tr.control <- trainControl(method="cv", number=10)
> > cp.grid <- expand.grid(.cp = (0:10)*0.001)
> >
> > #Fitting the model using regression tree
> > tr_m <- train(project_delay ~ project_lon + project_lat +
> project_duration + sector + contract_type + capital_value, data = trainPFI,
> method="rpart", trControl=tr.control, tuneGrid = cp.grid)
> >
> > tr_m
> >
> > CART
> > 491 samples
> > 15 predictor
> > No pre-processing
> > Resampling: Cross-Validated (10 fold)
> > Summary of sample sizes: 443, 442, 441, 442, 441, 442, ...
> > Resampling results across tuning parameters:
> >   cp RMSE  Rsquared
> >   0.000  441.1524  0.5417064
> >   0.001  439.6319  0.5451104
> > 

Re: [R] Problem while predicting in regression trees

2016-05-09 Thread Muhammad Bilal
Hi Bert,

Thanks for the response.

I checked the datasets, however, the Hospitals level appears in both of them. 
See the output below:

> sqldf("SELECT sector, count(*) FROM trainPFI GROUP BY sector")
sector count(*)
1  Defense9
2Hospitals  101
3  Housing   32
4   Others   99
5 Public Buildings   39
6  Schools  148
7  Social Care   10
8  Transportation   27
9Waste   26
> sqldf("SELECT sector, count(*) FROM testPFI GROUP BY sector")
sector count(*)
1  Defense5
2Hospitals   47
3  Housing   11
4   Others   44
5 Public Buildings   18
6  Schools   69
7  Social Care9
8   Transportation8
9Waste   12

Any thing else to try?

--
Muhammad Bilal
Research Fellow and Doctoral Researcher,
Bristol Enterprise, Research, and Innovation Centre (BERIC),
University of the West of England (UWE),
Frenchay Campus,
Bristol,
BS16 1QY

muhammad2.bi...@live.uwe.ac.uk



From: Bert Gunter <bgunter.4...@gmail.com>
Sent: 09 May 2016 01:42:39
To: Muhammad Bilal
Cc: r-help@r-project.org
Subject: Re: [R] Problem while predicting in regression trees

It seems that the data that you used for prediction contained a level
"Hospitals" for the sector factor that did not appear in the training
data (or maybe it's the other way round). Check this.

Cheers,
Bert


Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Sun, May 8, 2016 at 4:14 PM, Muhammad Bilal
<muhammad2.bi...@live.uwe.ac.uk> wrote:
> Hi All,
>
> I have the following script, that raises error at the last command. I am new 
> to R and require some clarification on what is going wrong.
>
> #Creating the training and testing data sets
> splitFlag <- sample.split(pfi_v3, SplitRatio = 0.7)
> trainPFI <- subset(pfi_v3, splitFlag==TRUE)
> testPFI <- subset(pfi_v3, splitFlag==FALSE)
>
>
> #Structure of the trainPFI data frame
>> str(trainPFI)
> ***
> 'data.frame': 491 obs. of  16 variables:
>  $ project_id : int  1 2 3 6 7 9 10 12 13 14 ...
>  $ project_lat: num  51.4 51.5 52.2 51.9 52.5 ...
>  $ project_lon: num  -0.642 -1.85 0.08 -0.401 -1.888 ...
>  $ sector : Factor w/ 9 levels "Defense","Hospitals",..: 4 4 
> 4 6 6 6 6 6 6 6 ...
>  $ contract_type  : chr  "Turnkey" "Turnkey" "Turnkey" "Turnkey" ...
>  $ project_duration   : int  1826 3652 121 730 730 790 522 819 998 372 ...
>  $ project_delay  : int  -323 0 -60 0 0 0 -91 0 0 7 ...
>  $ capital_value  : num  6.7 5.8 21.8 24.2 40.7 10.7 70 24.5 60.5 78 
> ...
>  $ project_delay_pct  : num  -17.7 0 -49.6 0 0 0 -17.4 0 0 1.9 ...
>  $ delay_type : Ord.factor w/ 9 levels "7 months early & 
> beyond"<..: 1 5 3 5 5 5 2 5 5 6 ...
>
> library(caret)
> library(e1071)
>
> set.seed(100)
>
> tr.control <- trainControl(method="cv", number=10)
> cp.grid <- expand.grid(.cp = (0:10)*0.001)
>
> #Fitting the model using regression tree
> tr_m <- train(project_delay ~ project_lon + project_lat + project_duration + 
> sector + contract_type + capital_value, data = trainPFI, method="rpart", 
> trControl=tr.control, tuneGrid = cp.grid)
>
> tr_m
>
> CART
> 491 samples
> 15 predictor
> No pre-processing
> Resampling: Cross-Validated (10 fold)
> Summary of sample sizes: 443, 442, 441, 442, 441, 442, ...
> Resampling results across tuning parameters:
>   cp RMSE  Rsquared
>   0.000  441.1524  0.5417064
>   0.001  439.6319  0.5451104
>   0.002  437.4039  0.5487203
>   0.003  432.3675  0.551
>   0.004  434.2138  0.5519964
>   0.005  431.6635  0.551
>   0.006  436.6163  0.5474135
>   0.007  440.5473  0.5407240
>   0.008  441.0876  0.5399614
>   0.009  441.5715  0.5401718
>   0.010  441.1401  0.5407121
> RMSE was used to select the optimal model using  the smallest value.
> The final value used for the model was cp = 0.005.
>
> #Fetching the best tree
> best_tree <- tr_m$finalModel
>
> Alright, all the aforementioned commands worked fine.
>
> Except the subsequent command raises error, when the developed model is used 
> to make predictions:
> best_tree_pred <- predict(best_tree, newdata = testPFI)
> Error in eval(expr, envir, enclos) : object 'sectorHospitals' not found
>
> Can someone guide me what to do to resolve this i

Re: [R] Problem while predicting in regression trees

2016-05-08 Thread Bert Gunter
It seems that the data that you used for prediction contained a level
"Hospitals" for the sector factor that did not appear in the training
data (or maybe it's the other way round). Check this.

Cheers,
Bert


Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Sun, May 8, 2016 at 4:14 PM, Muhammad Bilal
 wrote:
> Hi All,
>
> I have the following script, that raises error at the last command. I am new 
> to R and require some clarification on what is going wrong.
>
> #Creating the training and testing data sets
> splitFlag <- sample.split(pfi_v3, SplitRatio = 0.7)
> trainPFI <- subset(pfi_v3, splitFlag==TRUE)
> testPFI <- subset(pfi_v3, splitFlag==FALSE)
>
>
> #Structure of the trainPFI data frame
>> str(trainPFI)
> ***
> 'data.frame': 491 obs. of  16 variables:
>  $ project_id : int  1 2 3 6 7 9 10 12 13 14 ...
>  $ project_lat: num  51.4 51.5 52.2 51.9 52.5 ...
>  $ project_lon: num  -0.642 -1.85 0.08 -0.401 -1.888 ...
>  $ sector : Factor w/ 9 levels "Defense","Hospitals",..: 4 4 
> 4 6 6 6 6 6 6 6 ...
>  $ contract_type  : chr  "Turnkey" "Turnkey" "Turnkey" "Turnkey" ...
>  $ project_duration   : int  1826 3652 121 730 730 790 522 819 998 372 ...
>  $ project_delay  : int  -323 0 -60 0 0 0 -91 0 0 7 ...
>  $ capital_value  : num  6.7 5.8 21.8 24.2 40.7 10.7 70 24.5 60.5 78 
> ...
>  $ project_delay_pct  : num  -17.7 0 -49.6 0 0 0 -17.4 0 0 1.9 ...
>  $ delay_type : Ord.factor w/ 9 levels "7 months early & 
> beyond"<..: 1 5 3 5 5 5 2 5 5 6 ...
>
> library(caret)
> library(e1071)
>
> set.seed(100)
>
> tr.control <- trainControl(method="cv", number=10)
> cp.grid <- expand.grid(.cp = (0:10)*0.001)
>
> #Fitting the model using regression tree
> tr_m <- train(project_delay ~ project_lon + project_lat + project_duration + 
> sector + contract_type + capital_value, data = trainPFI, method="rpart", 
> trControl=tr.control, tuneGrid = cp.grid)
>
> tr_m
>
> CART
> 491 samples
> 15 predictor
> No pre-processing
> Resampling: Cross-Validated (10 fold)
> Summary of sample sizes: 443, 442, 441, 442, 441, 442, ...
> Resampling results across tuning parameters:
>   cp RMSE  Rsquared
>   0.000  441.1524  0.5417064
>   0.001  439.6319  0.5451104
>   0.002  437.4039  0.5487203
>   0.003  432.3675  0.551
>   0.004  434.2138  0.5519964
>   0.005  431.6635  0.551
>   0.006  436.6163  0.5474135
>   0.007  440.5473  0.5407240
>   0.008  441.0876  0.5399614
>   0.009  441.5715  0.5401718
>   0.010  441.1401  0.5407121
> RMSE was used to select the optimal model using  the smallest value.
> The final value used for the model was cp = 0.005.
>
> #Fetching the best tree
> best_tree <- tr_m$finalModel
>
> Alright, all the aforementioned commands worked fine.
>
> Except the subsequent command raises error, when the developed model is used 
> to make predictions:
> best_tree_pred <- predict(best_tree, newdata = testPFI)
> Error in eval(expr, envir, enclos) : object 'sectorHospitals' not found
>
> Can someone guide me what to do to resolve this issue.
>
> Any help will be highly appreciated.
>
> Many Thanks and
>
> Kind Regards
>
> --
> Muhammad Bilal
> Research Fellow and Doctoral Researcher,
> Bristol Enterprise, Research, and Innovation Centre (BERIC),
> University of the West of England (UWE),
> Frenchay Campus,
> Bristol,
> BS16 1QY
>
> muhammad2.bi...@live.uwe.ac.uk
>
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Problem while predicting in regression trees

2016-05-08 Thread Muhammad Bilal
Hi All,

I have the following script, that raises error at the last command. I am new to 
R and require some clarification on what is going wrong.

#Creating the training and testing data sets
splitFlag <- sample.split(pfi_v3, SplitRatio = 0.7)
trainPFI <- subset(pfi_v3, splitFlag==TRUE)
testPFI <- subset(pfi_v3, splitFlag==FALSE)


#Structure of the trainPFI data frame
> str(trainPFI)
***
'data.frame': 491 obs. of  16 variables:
 $ project_id : int  1 2 3 6 7 9 10 12 13 14 ...
 $ project_lat: num  51.4 51.5 52.2 51.9 52.5 ...
 $ project_lon: num  -0.642 -1.85 0.08 -0.401 -1.888 ...
 $ sector : Factor w/ 9 levels "Defense","Hospitals",..: 4 4 4 
6 6 6 6 6 6 6 ...
 $ contract_type  : chr  "Turnkey" "Turnkey" "Turnkey" "Turnkey" ...
 $ project_duration   : int  1826 3652 121 730 730 790 522 819 998 372 ...
 $ project_delay  : int  -323 0 -60 0 0 0 -91 0 0 7 ...
 $ capital_value  : num  6.7 5.8 21.8 24.2 40.7 10.7 70 24.5 60.5 78 ...
 $ project_delay_pct  : num  -17.7 0 -49.6 0 0 0 -17.4 0 0 1.9 ...
 $ delay_type : Ord.factor w/ 9 levels "7 months early & 
beyond"<..: 1 5 3 5 5 5 2 5 5 6 ...

library(caret)
library(e1071)

set.seed(100)

tr.control <- trainControl(method="cv", number=10)
cp.grid <- expand.grid(.cp = (0:10)*0.001)

#Fitting the model using regression tree
tr_m <- train(project_delay ~ project_lon + project_lat + project_duration + 
sector + contract_type + capital_value, data = trainPFI, method="rpart", 
trControl=tr.control, tuneGrid = cp.grid)

tr_m

CART
491 samples
15 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 443, 442, 441, 442, 441, 442, ...
Resampling results across tuning parameters:
  cp RMSE  Rsquared
  0.000  441.1524  0.5417064
  0.001  439.6319  0.5451104
  0.002  437.4039  0.5487203
  0.003  432.3675  0.551
  0.004  434.2138  0.5519964
  0.005  431.6635  0.551
  0.006  436.6163  0.5474135
  0.007  440.5473  0.5407240
  0.008  441.0876  0.5399614
  0.009  441.5715  0.5401718
  0.010  441.1401  0.5407121
RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was cp = 0.005.

#Fetching the best tree
best_tree <- tr_m$finalModel

Alright, all the aforementioned commands worked fine.

Except the subsequent command raises error, when the developed model is used to 
make predictions:
best_tree_pred <- predict(best_tree, newdata = testPFI)
Error in eval(expr, envir, enclos) : object 'sectorHospitals' not found

Can someone guide me what to do to resolve this issue.

Any help will be highly appreciated.

Many Thanks and

Kind Regards

--
Muhammad Bilal
Research Fellow and Doctoral Researcher,
Bristol Enterprise, Research, and Innovation Centre (BERIC),
University of the West of England (UWE),
Frenchay Campus,
Bristol,
BS16 1QY

muhammad2.bi...@live.uwe.ac.uk


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.