Re: [R] remove extreme values or winsorize – loop - dataframe

Cecilia Carmo Mon, 02 Aug 2010 15:55:06 -0700

Thank you again, but I think I need to do some homeworkabout the split function, because I'm not understanding itvery well.Besides, I think I still have a problem. I also need X2 =X1 winsorized: X2 is equal to X1 between 10%-90%, and isequal to the 10% value when < 10% and equal to the 90%value when it is >.

Could you help me?

Thank you
Cecília


Em Mon, 2 Aug 2010 18:42:27 -0400
 jim holtman <jholt...@gmail.com> escreveu:

This is just following up with the example data yousent. This willcreate a list 'result' that will have the subset of databetween the
10% & 90%-tiles of the data:
#My reproducible example:
firm<-sort(rep(1:1000,10),decreasing=F)
year<-rep(1998:2007,1000)
industry<-rep(c(rep(1,10),rep(2,10),rep(3,10),rep(4,10),rep(5,10),rep(6,10),rep(7,10),rep(8,10),rep(9,10),
+ rep(10,10)),1000)
X1<-rnorm(10000)
data<-data.frame(firm, industry,year,X1)
# split the data by industry/year
d.s <- split(data, list(data$industry, data$year),drop=TRUE)
result <- lapply(d.s, function(.id){
+    # get 10/90% values
+    .limit <- quantile(.id$X1, prob=c(.1, .9))
+    subset(.id, X1 >= .limit[1] & X1 <= .limit[2])
+ })
str(result)
List of 100
$ 1.1998 :'data.frame':        800 obs. of  4 variables:
..$ firm : int [1:800] 1 21 31 41 51 61 71 81 91 111...
 ..$ industry: num [1:800] 1 1 1 1 1 1 1 1 1 1 ...
..$ year : int [1:800] 1998 1998 1998 1998 1998 19981998 1998
1998 1998 ...
..$ X1 : num [1:800] 0.659 -0.105 -0.617 0.342-1.077 ...
$ 2.1998 :'data.frame':        800 obs. of  4 variables:
..$ firm : int [1:800] 2 32 42 52 62 72 102 112 132162 ...
 ..$ industry: num [1:800] 2 2 2 2 2 2 2 2 2 2 ...
..$ year : int [1:800] 1998 1998 1998 1998 1998 19981998 1998
1998 1998 ...
..$ X1 : num [1:800] -1.1044 -0.0666 -0.91840.3469 -0.2348 ...
You can see that the 'name' of the list element is theindustry.year
combination; this can also be seen in the data.
On Mon, Aug 2, 2010 at 6:20 PM, Cecilia Carmo<cecilia.ca...@ua.pt> wrote:
Thank you for your help but I don't understand how can Ihave a dataframewith the columns: firm, year, industry, X1 and X2. Couldyou help me
(again)?


Cecília Carmo


Em Sat, 31 Jul 2010 22:10:38 -0400
 jim holtman <jholt...@gmail.com> escreveu:
This will split the data by industry & year and thenreturn the values
that include the 80%-tile (>=10% & <= 90%)

# split the data by industry/year
d.s <- split(data, list(data$industry, data$year),drop=TRUE)
result <- lapply(d.s, function(.id){
  # get 10/90% values
  .limit <- quantile(.id$X1, prob=c(.1, .9))
  subset(.id, X1 >= .limit[1] & X1 <= .limit[2])
})
This returns a list of 100 elements for eachcombination.
On Sat, Jul 31, 2010 at 9:39 PM, Cecilia Carmo<cecilia.ca...@ua.pt>
wrote:
Hi everyone!
#I need a loop or a function that creates a X2 variablethat is X1
without
the extreme values (or X1 winsorized) by industry andyear.
#My reproducible example:
firm<-sort(rep(1:1000,10),decreasing=F)
year<-rep(1998:2007,1000)

industry<-rep(c(rep(1,10),rep(2,10),rep(3,10),rep(4,10),rep(5,10),rep(6,10),rep(7,10),rep(8,10),rep(9,10),
rep(10,10)),1000)
X1<-rnorm(10000)
data<-data.frame(firm, industry,year,X1)
data
The way I’m doing this is very hard. I split my sampleby industry and
year,
for each industry and year I calculate the 10% and 90%quantiles, then I
create a X2 variable like this:

industry1<-subset(data,data$industry==1)

ind1year1999<-subset(industry1,industry1$year==1999)
q1<-quantile(ind1year1999$X1,probs=0.1,na.rm=TRUE)
q99<-quantile(ind1year1999$X1,probs=0.90,na.rm=TRUE)

ind1year1999winsorized<-transform(ind1year1999,X2=ifelse(X1<q1,q1,ifelse(X1>q99,q99,X1)))

ind1year2000<-subset(industry1,industry1$year==2000)
q1<-quantile(ind1year2000$X1,probs=0.1,na.rm=TRUE)
q99<-quantile(ind1year2000$X1,probs=0.90,na.rm=TRUE)

ind1year2000winsorized<-transform(ind1year2000,X2=ifelse(X1<q1,q1,ifelse(X1>q99,q99,X1)))
I repeat this for all years and industries, and then Imerge/bind all
again
to have a new dataframe with all the columns of thedataframe «data» plus
X2.

Could anyone help me doing this in a easier way?

Thanks
Cecília Carmo
Universidade de Aveiro - Portugal

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained,reproducible code.
--
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?
--
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] remove extreme values or winsorize – loop - dataframe

Reply via email to