I'm traveling and cannot take the time to go through this code more thoroughly. However, I see at least a few issues worth mentioning in addition to the other comments:
1) Note that Kristoffer's function 'f' in https://gist.github.com/KristofferC/207dcc6f4f22eed73562 does not include the readtable operation, so that is not being included in the "fast" time. Work is being done on a performant CSV reader, but I believe the present implementations in both DataFrames and Base are either on a par or slower than R's reader. 2) Unfortunately, naive indexing into DataFrames is not type-certain. That is to say that Julia's JIT compilation mechanism will not register the type information of a value returned from indexing into a DataFrame. However, it will detect this information if you pass the relevant columns of said DataFrame into a function and perform the relevant operations on the columns within the function. Thus you may see some further perf improvements if you wrap the loop in Kristoffer's "fast" 'f' in another function that takes the relevant columns of the DataFrame as arguments. 3) I haven't tested this hypothesis, but I doubt that the macros per se are incurring the performance penalty. In fact, the @with macro employs precisely the strategy noted above and for this reason is generally faster than the naive alternative. I suspect what is really contributing to the performance costs in those lines is the extensive use of broadcasted operators, since those allocate new arrays. If you can find a way to do those operations in-place you may save some time there.A 4) Naive use of anonymous functions currently can incur a performance penalty, which may be affecting the 'join' operation in your original code. On Wednesday, August 26, 2015 at 4:43:13 PM UTC-6, Michael Wang wrote: > > Thank you Kristoffer. This helps a lot. > > On Sunday, August 23, 2015 at 12:46:04 PM UTC-7, Kristoffer Carlsson wrote: >> >> What's good with Julia is that if something is if a library or something >> is slow it is usually easy to tweak it to get your performance. >> >> For example, if we only focus on the beta regressions, here is your >> current code (a little modified to remove the benchmarking from reading the >> csv): https://gist.github.com/KristofferC/f424e1d7daf4c6496ad2 >> >> On my computer it takes this amount of time: >> >> elapsed time: 4.925883456 seconds (1697125616 bytes allocated, 31.27% gc >> time) >> >> However, we are not using all the information we have. For example, from >> your file we can see that the entries are sorted. We can therefore quickly >> find the candidate rows that have the correct "fm" and then from these >> candidate rows find the ones that are within the window. Something like >> this: https://gist.github.com/KristofferC/207dcc6f4f22eed73562. I >> haven't tested this carefully so it might be some off by one errors but it >> should do the same amount of work. >> >> This one takes: >> elapsed time: 0.220236023 seconds (65040756 bytes allocated, 22.00% gc >> time) >> >> I don't know if you can tell DataFramesMeta that columns are sorted, if >> so it can probably be sped up. >> >> >> On Sunday, August 23, 2015 at 1:01:28 AM UTC+2, Michael Wang wrote: >>> >>> I am new to Julia. I heard that Julia has the performance with Cpp even >>> though it is a high level language. I tested an example on my machine, >>> however, the result was that Julia was in the same ballpark with R not with >>> Cpp. Here is my codes. >>> R: >>> ptm <- proc.time() >>> >>> DPY <- 252 ## days per year >>> NWINDOW <- 126 ## can be smaller or larger than 252 >>> >>> ds <- read.csv("xri.csv") ## a sample data set >>> >>> ## PS: this is much faster than assigning to a data frame in a loop >>> b.ols <- sd.ols <- rep(NA, nrow(ds)) >>> >>> for (i in 1:nrow(ds)) { >>> thisday <- ds$day[i] >>> if (thisday %% DPY != 0) next ## calculate only on year end >>> if (thisday < DPY) next ## start only without NA >>> thisfm <- ds$fm[i] >>> datasubset <- subset( ds, (ds$fm==thisfm) & >>> (ds$day>=(thisday-NWINDOW)) & (ds$day<=(thisday-1)) ) >>> olsreg <- lm(xr ~ xm, data = datasubset) >>> b.ols[i] <- coef(olsreg)[2] >>> sd.ols[i] <- sqrt(vcov(olsreg)[2, 2]) >>> cat(i, " ") ## ping me to see we are not dead for large data sets >>> } >>> >>> ds$b.ols <- b.ols >>> ds$sd.ols <- sd.ols >>> >>> cat("\nOLS Beta Regressions are Done\n") >>> >>> ds$xsect.sd <- ave(ds$b.ols, ds$day, FUN=function(x) sd(x, na.rm=T)) >>> ds$xsect.mean <- ave(ds$b.ols, ds$day, FUN=function(x) mean(x, na.rm=T)) >>> >>> cat("Cross-Sectional OLS Statistics are Done\n") >>> >>> ds <- within(ds, { >>> w.ols <- xsect.sd^2/(sd.ols^2+xsect.sd^2) >>> b.vck <- round(w.ols*b.ols + (1-w.ols)*xsect.mean,4) >>> b.ols <- round(b.ols,4) >>> }) >>> >>> cat("OLS and VCK are Done. Now Writing Output.\n") >>> >>> proc.time() - ptm >>> >>> >>> The running time is around 30 seconds for R. >>> >>> Julia: >>> using DataFrames >>> using DataFramesMeta >>> using GLM >>> >>> tic() >>> DPY = 252 ## days per year >>> NWINDOW = 126 ## can be smaller or larger than 252 >>> >>> ds = readtable("xri.csv") ## a sample data set >>> >>> # create two empty arrays to store b_ols and sd_ols value >>> b_ols = DataArray(Float64, size(ds)[1]) >>> sd_ols = DataArray(Float64, size(ds)[1]) >>> >>> for i = 1:size(ds)[1] >>> thisDay = ds[i, :day] ## Julia DataFrame way of accessing data, in R: >>> ds$day[i] >>> if mod(thisDay, DPY) != 0 >>> continue >>> end >>> if thisDay < DPY >>> continue >>> end >>> thisFm = ds[i, :fm] >>> dataSubset = @where(ds, (:fm .== thisFm) & (:day .>= (thisDay - >>> NWINDOW)) & (:day .<= (thisDay - 1))) >>> olsReg = fit(LinearModel, xr ~ xm, dataSubset) ## OLS from package GLM >>> b_ols[i] = coef(olsReg)[2] ## returns the OLS coefficients >>> sd_ols[i] = stderr(olsReg)[2] ## returns the OLS coefficients' standard >>> error >>> print(i, " ") >>> end >>> >>> ds[:b_ols] = b_ols >>> ds[:sd_ols] = sd_ols >>> >>> print("\nOLS Beta Regressions are Done\n") >>> >>> ds = join(ds, by(ds, :day) do ds >>> DataFrame(xsect_mean = mean(dropna(ds[:b_ols])), xsect_sd = >>> std(dropna(ds[:b_ols]))) >>> end, on = [:day], kind = :inner) >>> ds = sort!(ds) >>> >>> print("Cross-Sectional OLS Statistics are Done\n") >>> >>> ds[:w_ols] = @with(ds, :xsect_sd.^2 ./ (:sd_ols.^2 + :xsect_sd.^2)) >>> ds[:b_vck] = @with(ds, round(:w_ols .* :b_ols + (1 - :w_ols) .* >>> :xsect_mean, 4)) >>> ds[:b_ols] = @with(ds, round(:b_ols, 4)) >>> >>> print("OLS and VCK are Done. Now Writing Output.\n") >>> >>> toc() >>> >>> The running time is around 15 seconds for Julia. >>> >>> I tried C++, too. Having the same output with R and Julia, C++ only used >>> 0.23 seconds. Can someone tell me why this is happening? >>> >>> >>> >>> >>> >>>