Re: [R] index instead of loop?

Ben quant Thu, 08 Mar 2012 07:47:12 -0800

Humm.... If I understand what you are saying, you are correct. I get
144.138 for 2009-03-20 for column C. Maybe I posted the wrong code?  If so,
sorry.  Let me know if you disagree. I still plan to come back to this and
optimize it more, so if you see anything that would make it faster that
would be great. Of course, the for loop is my focus for optimization. Due
to some issues in the real data I had to add the lag and lag2 stuff in (I
don't think I had that before). In my real data the values don't really
belong in the z.dates the are aligned with, but to avoid lots of empty
values in the flat matrix (ua) they were forced in. I can push them into
their "real" dates via looking at a deeper lag. I'm thinking that all the
"which" stuff in the for look can be nested so that it runs faster. Also
the as.Date, abs() and max(which( etc. stuff seems like it could be handled
better/faster or outside the loop.


If you are interested in helping further, I can post a link to some 'real'
data.

Here is what I am using now and it seems to work.  Sorry, my code is still
very fluid:

z.dates =
c("2007-03-31","2007-06-30","2007-09-30","2007-12-31","2008-03-31","2008-06-30","2008-09-30","2008-12-31")

nms = c("A","B","C","D")
# these are the report dates that are the real days the data was available
rd1 =
matrix(c("20070514","20070814","20071115","20080213","20080514","20080814","20081114","20090217",

"20070410","20070709","20071009","20080109","20080407","20080708","20081007","20090112",
               "20070426","--","--","--","--","--","--","20090319",
               "--","--","--","--","--","--","--","--"),
             nrow=8,ncol=4)
dimnames(rd1) = list(z.dates,nms)

# this is the unadjusted raw data, that always has the same dimensions,
rownames, and colnames as the report dates
ua = matrix(c(640.35,636.16,655.91,657.41,682.06,702.90,736.15,667.65,

2625.050,2625.050,2645.000,2302.000,1972.000,1805.000,1547.000,1025.000,
              NaN, NaN,-98.426,190.304,180.894,183.220,172.520, 144.138,
              NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN),
            nrow=8,ncol=4)
dimnames(ua) = list(z.dates,nms)

z.dates = rownames(ua)
############################## by rows
##########################################  FASTEST

start_t_all = Sys.time()
fix = function(x)
{
  year = substring(x, 1, 4)
  mo = substring(x, 5, 6)
  day = substring(x, 7, 8)
  ifelse(year=="--", "--", paste(year, mo, day, sep = "-"))

}
rd = apply(rd1, 2, fix)
dimnames(rd) = dimnames(rd)

wd1 <- seq(from =as.Date(min(z.dates)), to = Sys.Date(), by = "day")
wd1 = wd1[weekdays(wd1) == "Friday"] # uncomment to go weekly
wd = sapply(wd1, as.character)

mat = matrix(NA,nrow=length(wd),ncol=ncol(ua))
rownames(mat) = wd
nms = as.Date(rownames(ua))

for(i in 1:length(wd)){
  d = as.Date(wd[i])
  diff = abs(nms - d)
  rd_row_idx = max(which(diff == min(diff)))
  rd_col_idx = which(as.Date(rd[rd_row_idx,], format="%Y-%m-%d")  < d)
  rd_col_idx_lag = which(as.Date(rd[rd_row_idx - 1,], format="%Y-%m-%d")  <
d)
  rd_col_idx_lag2 = which(as.Date(rd[rd_row_idx - 2,], format="%Y-%m-%d")
< d)

  if(length(rd_col_idx_lag2) && (rd_row_idx - 2) > 0){

    mat[i,rd_col_idx_lag2] = ua[rd_row_idx - 2,rd_col_idx_lag2]
  }
  if(length(rd_col_idx_lag)){
    mat[i,rd_col_idx_lag] = ua[rd_row_idx - 1,rd_col_idx_lag]
  }
  if( length(rd_col_idx)){
    mat[i,rd_col_idx] = ua[rd_row_idx,rd_col_idx]
  }
}
colnames(mat)=colnames(ua)
print(Sys.time()-start_t_all)


Let me know if you disagree,

Ben

On Wed, Mar 7, 2012 at 5:57 PM, Rui Barradas <rui1...@sapo.pt> wrote:

> Hello again.
>
>
> Ben quant wrote
> >
> > Hello,
> >
> > In case anyone is interested in a faster solution for lots of columns.
> > This
> > solution is slower if you only have a few columns.  If anyone has
> anything
> > faster, I would be interested in seeing it.
> >
> > ### some mockup data
> > z.dates =
> >
> c("2007-03-31","2007-06-30","2007-09-30","2007-12-31","2008-03-31","2008-06-30","2008-09-30","2008-12-31")
> >
> > nms = c("A","B","C","D") # add more columns to see how the code below is
> > fsater
> > # these are the report dates that are the real days the data was
> > available,
> > so show the data the day after this date ('after' is a design decision)
> > rd1 = matrix(c("20070514","20070814","20071115",   "20080213",
> > "20080514",  "20080814",  "20081114",  "20090217",
> >                "20070410","20070709","20071009",   "20080109",
> > "20080407",  "20080708",  "20081007",  "20090112",
> >                "20070426","--","--","--","--","--","--","20090319",
> >                "--","--","--","--","--","--","--","--"),
> >              nrow=8,ncol=4)
> > dimnames(rd1) = list(z.dates,nms)
> >
> > # this is the unadjusted raw data, that always has the same dimensions,
> > rownames, and colnames as the report dates
> > ua = matrix(c(640.35,636.16,655.91,657.41,682.06,702.90,736.15,667.65,
> >
> > 2625.050,2625.050,2645.000,2302.000,1972.000,1805.000,1547.000,1025.000,
> >               NaN, NaN,-98.426,190.304,180.894,183.220,172.520, 144.138,
> >               NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN),
> >             nrow=8,ncol=4)
> > dimnames(ua) = list(z.dates,nms)
> >
> > ################################ the fastest code I have found:
> >
> > start_t_all = Sys.time()
> > fix = function(x)
> > {
> >   year = substring(x, 1, 4)
> >   mo = substring(x, 5, 6)
> >   day = substring(x, 7, 8)
> >   ifelse(year=="--", "NA", paste(year, mo, day, sep = "-"))
> > }
> >
> > rd = apply(rd1, 2, fix)
> > dimnames(rd) = dimnames(rd)
> >
> > wd1 <- seq(from =as.Date(min(z.dates)), to = Sys.Date(), by = "day")
> > #wd1 = wd1[weekdays(wd1) == "Friday"] # uncomment to go weekly
> > wd = sapply(wd1, as.character)
> >
> > mat = matrix(NA,nrow=length(wd),ncol=ncol(ua))
> > rownames(mat) = wd
> > nms = as.Date(rownames(ua))
> >
> > for(i in 1:length(wd)){
> >   d = as.Date(wd[i])
> >   diff = abs(nms - d)
> >   rd_row_idx = max(which(diff == min(diff)))
> >   rd_col_idx = which(rd[rd_row_idx,] < d)
> >
> >   if((rd_row_idx - 1) > 0){
> >     mat[i,] = ua[rd_row_idx - 1,]
> >   }
> >   if( length(rd_col_idx)){
> >     mat[i,rd_col_idx] = ua[rd_row_idx,rd_col_idx]
> >   }
> > }
> > colnames(mat)=colnames(ua)
> > print(Sys.time()-start_t_all)
> >
> > Regards,
> >
> > Ben
> >
> > On Tue, Mar 6, 2012 at 8:22 AM, Rui Barradas &lt;rui1174@&gt; wrote:
> >
> >> Hello,
> >>
> >> > Just looking at this, but it looks like ix doesn't exist:
> >> >        sapply(1:length(inxlist), function(i) if(length(ix[[i]]))
> >> > fin1[ix[[i]], tkr + 1] <<- ua[i, tkr])
> >> >
> >> >  Trying to sort it out now.
> >>
> >> Right, sorry.
> >> I've changed the name from 'ix' to 'inxlist' to make it more readable
> >> just
> >> before posting.
> >> And since the object 'ix' still existed in the R global environment it
> >> didn't throw an error...
> >>
> >> Your correction in the post that followed is what I meant.
> >>
> >> Correction (full loop, tested):
> >>
> >> for(tkr in 1:ncol(ua)){
> >>        x  <- c(rd1[, tkr], as.Date("9999-12-31"))
> >>         ix <- lapply(1:nr, function(i)
> >>                         which(x[i] <= dt1 & dt1 < x[i + 1]))
> >>         sapply(1:length(ix), function(i)
> >>                 if(length(ix[[i]])) fin1[ix[[i]], tkr + 1] <<- ua[i,
> >> tkr])
> >> }
> >>
> >> Rui Barradas
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://r.789695.n4.nabble.com/index-instead-of-loop-tp4447672p4450186.html
> >> Sent from the R help mailing list archive at Nabble.com.
> >>
> >> ______________________________________________
> >> R-help@ mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help@ mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>
> Maybe I'm not understanding the problem very well, but let me describe what
> I'm thinking it is.
>
> You have two tables, 'rd1' and 'ua' and a vector of dates, 'z.dates'.  The
> result is a table such that:
> 1. From 'z.dates' make a vector of daily dates.
> 2. Each column is filled with numbers from 'ua' based on dates in 'rd1',
> starting at the day given in step 1.
> My doubt is that your last posted code seems to give a special role to
> column 'A'.
>
> > mat[225:232, ]
>                               A       B              C      D
> 2007-11-10 636.16 2645     NaN NaN
> 2007-11-11 636.16 2645     NaN NaN
> 2007-11-12 636.16 2645     NaN NaN
> 2007-11-13 636.16 2645     NaN NaN
> 2007-11-14 636.16 2645     NaN NaN
> 2007-11-15 655.91 2645 -98.426 NaN
> 2007-11-16 655.91 2645 -98.426 NaN
> 2007-11-17 655.91 2645 -98.426 NaN
>
> The values in column 'C' change following the date in column 'A'. That is
> the third date in 'rd1',
> more exactly, rd1[3, 1] == "20071115".
>
> Shouldn't the values in mat[, "C"] start at 2009-03-20? The corresponding
> value in 'ua' would then be 144.138.
>
> (I still believe this can be made much faster.)
>
> Rui Barradas
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/index-instead-of-loop-tp4447672p4455223.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] index instead of loop?

Reply via email to