What's good with Julia is that if something is if a library or something is 
slow it is usually easy to tweak it to get your performance.

For example, if we only focus on the beta regressions, here is your current 
code (a little modified to remove the benchmarking from reading the csv): 
https://gist.github.com/KristofferC/f424e1d7daf4c6496ad2

On my computer it takes this amount of time:

elapsed time: 4.925883456 seconds (1697125616 bytes allocated, 31.27% gc 
time)

However, we are not using all the information we have. For example, from 
your file we can see that the entries are sorted. We can therefore quickly 
find the candidate rows that have the correct "fm" and then from these 
candidate rows find the ones that are within the window. Something like 
this: https://gist.github.com/KristofferC/207dcc6f4f22eed73562. I haven't 
tested this carefully so it might be some off by one errors but it should 
do the same amount of work.

This one takes:
elapsed time: 0.220236023 seconds (65040756 bytes allocated, 22.00% gc time)

I don't know if you can tell DataFramesMeta that columns are sorted, if so 
it can probably be sped up.


On Sunday, August 23, 2015 at 1:01:28 AM UTC+2, Michael Wang wrote:
>
> I am new to Julia. I heard that Julia has the performance with Cpp even 
> though it is a high level language. I tested an example on my machine, 
> however, the result was that Julia was in the same ballpark with R not with 
> Cpp. Here is my codes.
> R:
> ptm <- proc.time()
>
> DPY <- 252  ## days per year
> NWINDOW <- 126  ## can be smaller or larger than 252
>
> ds <- read.csv("xri.csv")  ## a sample data set
>
> ## PS: this is much faster than assigning to a data frame in a loop
> b.ols <- sd.ols <- rep(NA, nrow(ds))
>
> for (i in 1:nrow(ds)) {
>     thisday <- ds$day[i]
>     if (thisday %% DPY != 0) next  ## calculate only on year end
>     if (thisday < DPY) next  ## start only without NA
>     thisfm <- ds$fm[i]
>     datasubset <- subset( ds, (ds$fm==thisfm) & 
> (ds$day>=(thisday-NWINDOW)) & (ds$day<=(thisday-1)) )
>     olsreg <- lm(xr ~ xm, data = datasubset)
>     b.ols[i] <- coef(olsreg)[2]
>     sd.ols[i] <- sqrt(vcov(olsreg)[2, 2])
>     cat(i, " ")  ## ping me to see we are not dead for large data sets
> }
>
> ds$b.ols <- b.ols
> ds$sd.ols <- sd.ols
>
> cat("\nOLS Beta Regressions are Done\n")
>
> ds$xsect.sd <- ave(ds$b.ols, ds$day, FUN=function(x) sd(x, na.rm=T))
> ds$xsect.mean <- ave(ds$b.ols, ds$day, FUN=function(x) mean(x, na.rm=T))
>
> cat("Cross-Sectional OLS Statistics are Done\n")
>
> ds <- within(ds, {
>                  w.ols <- xsect.sd^2/(sd.ols^2+xsect.sd^2)
>                  b.vck <- round(w.ols*b.ols + (1-w.ols)*xsect.mean,4)
>                  b.ols <- round(b.ols,4)
>              })
>
> cat("OLS and VCK are Done.  Now Writing Output.\n")
>
> proc.time() - ptm
>
>
> The running time is around 30 seconds for R.
>
> Julia:
> using DataFrames
> using DataFramesMeta
> using GLM
>
> tic()
> DPY = 252  ## days per year
> NWINDOW = 126  ## can be smaller or larger than 252
>
> ds = readtable("xri.csv")  ## a sample data set
>
> # create two empty arrays to store b_ols and sd_ols value
> b_ols = DataArray(Float64, size(ds)[1])
> sd_ols = DataArray(Float64, size(ds)[1])
>
> for i = 1:size(ds)[1]
> thisDay = ds[i, :day] ## Julia DataFrame way of accessing data, in R: 
> ds$day[i]
> if mod(thisDay, DPY) != 0
> continue
> end
> if thisDay < DPY
> continue
> end
> thisFm = ds[i, :fm]
> dataSubset = @where(ds, (:fm .== thisFm) & (:day .>= (thisDay - NWINDOW)) 
> & (:day .<= (thisDay - 1)))
> olsReg = fit(LinearModel, xr ~ xm, dataSubset) ## OLS from package GLM
> b_ols[i] = coef(olsReg)[2] ## returns the OLS coefficients
> sd_ols[i] = stderr(olsReg)[2] ## returns the OLS coefficients' standard 
> error
> print(i, " ")
> end
>
> ds[:b_ols] = b_ols
> ds[:sd_ols] = sd_ols
>
> print("\nOLS Beta Regressions are Done\n")
>
> ds = join(ds, by(ds, :day) do ds
>     DataFrame(xsect_mean = mean(dropna(ds[:b_ols])), xsect_sd = 
> std(dropna(ds[:b_ols])))
> end, on = [:day], kind = :inner)
> ds = sort!(ds)
>
> print("Cross-Sectional OLS Statistics are Done\n")
>
> ds[:w_ols] = @with(ds, :xsect_sd.^2 ./ (:sd_ols.^2 + :xsect_sd.^2))
> ds[:b_vck] = @with(ds, round(:w_ols .* :b_ols + (1 - :w_ols) .* 
> :xsect_mean, 4))
> ds[:b_ols] = @with(ds, round(:b_ols, 4))
>
> print("OLS and VCK are Done.  Now Writing Output.\n")
>
> toc()
>
> The running time is around 15 seconds for Julia.
>
> I tried C++, too. Having the same output with R and Julia, C++ only used 
> 0.23 seconds. Can someone tell me why this is happening?
>
>
>
>
>
>

Reply via email to