I'm traveling and cannot take the time to go through this code more 
thoroughly. However, I see at least a few issues worth mentioning in 
addition to the other comments:

1) Note that Kristoffer's function 'f' in 
https://gist.github.com/KristofferC/207dcc6f4f22eed73562 does not include 
the readtable operation, so that is not being included in the "fast" time. 
Work is being done on a performant CSV reader, but I believe the present 
implementations in both DataFrames and Base are either on a par or slower 
than R's reader.

2) Unfortunately, naive indexing into DataFrames is not type-certain. That 
is to say that Julia's JIT compilation mechanism will not register the type 
information of a value returned from indexing into a DataFrame. However, it 
will detect this information if you pass the relevant columns of said 
DataFrame into a function and perform the relevant operations on the 
columns within the function. Thus you may see some further perf 
improvements if you wrap the loop in Kristoffer's "fast" 'f' in another 
function that takes the relevant columns of the DataFrame as arguments.

3) I haven't tested this hypothesis, but I doubt that the macros per se are 
incurring the performance penalty. In fact, the @with macro employs 
precisely the strategy noted above and for this reason is generally faster 
than the naive alternative. I suspect what is really contributing to the 
performance costs in those lines is the extensive use of broadcasted 
operators, since those allocate new arrays. If you can find a way to do 
those operations in-place you may save some time there.A

4) Naive use of anonymous functions currently can incur a performance 
penalty, which may be affecting the 'join' operation in your original code.

On Wednesday, August 26, 2015 at 4:43:13 PM UTC-6, Michael Wang wrote:
>
> Thank you Kristoffer. This helps a lot.
>
> On Sunday, August 23, 2015 at 12:46:04 PM UTC-7, Kristoffer Carlsson wrote:
>>
>> What's good with Julia is that if something is if a library or something 
>> is slow it is usually easy to tweak it to get your performance.
>>
>> For example, if we only focus on the beta regressions, here is your 
>> current code (a little modified to remove the benchmarking from reading the 
>> csv): https://gist.github.com/KristofferC/f424e1d7daf4c6496ad2
>>
>> On my computer it takes this amount of time:
>>
>> elapsed time: 4.925883456 seconds (1697125616 bytes allocated, 31.27% gc 
>> time)
>>
>> However, we are not using all the information we have. For example, from 
>> your file we can see that the entries are sorted. We can therefore quickly 
>> find the candidate rows that have the correct "fm" and then from these 
>> candidate rows find the ones that are within the window. Something like 
>> this: https://gist.github.com/KristofferC/207dcc6f4f22eed73562. I 
>> haven't tested this carefully so it might be some off by one errors but it 
>> should do the same amount of work.
>>
>> This one takes:
>> elapsed time: 0.220236023 seconds (65040756 bytes allocated, 22.00% gc 
>> time)
>>
>> I don't know if you can tell DataFramesMeta that columns are sorted, if 
>> so it can probably be sped up.
>>
>>
>> On Sunday, August 23, 2015 at 1:01:28 AM UTC+2, Michael Wang wrote:
>>>
>>> I am new to Julia. I heard that Julia has the performance with Cpp even 
>>> though it is a high level language. I tested an example on my machine, 
>>> however, the result was that Julia was in the same ballpark with R not with 
>>> Cpp. Here is my codes.
>>> R:
>>> ptm <- proc.time()
>>>
>>> DPY <- 252  ## days per year
>>> NWINDOW <- 126  ## can be smaller or larger than 252
>>>
>>> ds <- read.csv("xri.csv")  ## a sample data set
>>>
>>> ## PS: this is much faster than assigning to a data frame in a loop
>>> b.ols <- sd.ols <- rep(NA, nrow(ds))
>>>
>>> for (i in 1:nrow(ds)) {
>>>     thisday <- ds$day[i]
>>>     if (thisday %% DPY != 0) next  ## calculate only on year end
>>>     if (thisday < DPY) next  ## start only without NA
>>>     thisfm <- ds$fm[i]
>>>     datasubset <- subset( ds, (ds$fm==thisfm) & 
>>> (ds$day>=(thisday-NWINDOW)) & (ds$day<=(thisday-1)) )
>>>     olsreg <- lm(xr ~ xm, data = datasubset)
>>>     b.ols[i] <- coef(olsreg)[2]
>>>     sd.ols[i] <- sqrt(vcov(olsreg)[2, 2])
>>>     cat(i, " ")  ## ping me to see we are not dead for large data sets
>>> }
>>>
>>> ds$b.ols <- b.ols
>>> ds$sd.ols <- sd.ols
>>>
>>> cat("\nOLS Beta Regressions are Done\n")
>>>
>>> ds$xsect.sd <- ave(ds$b.ols, ds$day, FUN=function(x) sd(x, na.rm=T))
>>> ds$xsect.mean <- ave(ds$b.ols, ds$day, FUN=function(x) mean(x, na.rm=T))
>>>
>>> cat("Cross-Sectional OLS Statistics are Done\n")
>>>
>>> ds <- within(ds, {
>>>                  w.ols <- xsect.sd^2/(sd.ols^2+xsect.sd^2)
>>>                  b.vck <- round(w.ols*b.ols + (1-w.ols)*xsect.mean,4)
>>>                  b.ols <- round(b.ols,4)
>>>              })
>>>
>>> cat("OLS and VCK are Done.  Now Writing Output.\n")
>>>
>>> proc.time() - ptm
>>>
>>>
>>> The running time is around 30 seconds for R.
>>>
>>> Julia:
>>> using DataFrames
>>> using DataFramesMeta
>>> using GLM
>>>
>>> tic()
>>> DPY = 252  ## days per year
>>> NWINDOW = 126  ## can be smaller or larger than 252
>>>
>>> ds = readtable("xri.csv")  ## a sample data set
>>>
>>> # create two empty arrays to store b_ols and sd_ols value
>>> b_ols = DataArray(Float64, size(ds)[1])
>>> sd_ols = DataArray(Float64, size(ds)[1])
>>>
>>> for i = 1:size(ds)[1]
>>> thisDay = ds[i, :day] ## Julia DataFrame way of accessing data, in R: 
>>> ds$day[i]
>>> if mod(thisDay, DPY) != 0
>>> continue
>>> end
>>> if thisDay < DPY
>>> continue
>>> end
>>> thisFm = ds[i, :fm]
>>> dataSubset = @where(ds, (:fm .== thisFm) & (:day .>= (thisDay - 
>>> NWINDOW)) & (:day .<= (thisDay - 1)))
>>> olsReg = fit(LinearModel, xr ~ xm, dataSubset) ## OLS from package GLM
>>> b_ols[i] = coef(olsReg)[2] ## returns the OLS coefficients
>>> sd_ols[i] = stderr(olsReg)[2] ## returns the OLS coefficients' standard 
>>> error
>>> print(i, " ")
>>> end
>>>
>>> ds[:b_ols] = b_ols
>>> ds[:sd_ols] = sd_ols
>>>
>>> print("\nOLS Beta Regressions are Done\n")
>>>
>>> ds = join(ds, by(ds, :day) do ds
>>>     DataFrame(xsect_mean = mean(dropna(ds[:b_ols])), xsect_sd = 
>>> std(dropna(ds[:b_ols])))
>>> end, on = [:day], kind = :inner)
>>> ds = sort!(ds)
>>>
>>> print("Cross-Sectional OLS Statistics are Done\n")
>>>
>>> ds[:w_ols] = @with(ds, :xsect_sd.^2 ./ (:sd_ols.^2 + :xsect_sd.^2))
>>> ds[:b_vck] = @with(ds, round(:w_ols .* :b_ols + (1 - :w_ols) .* 
>>> :xsect_mean, 4))
>>> ds[:b_ols] = @with(ds, round(:b_ols, 4))
>>>
>>> print("OLS and VCK are Done.  Now Writing Output.\n")
>>>
>>> toc()
>>>
>>> The running time is around 15 seconds for Julia.
>>>
>>> I tried C++, too. Having the same output with R and Julia, C++ only used 
>>> 0.23 seconds. Can someone tell me why this is happening?
>>>
>>>
>>>
>>>
>>>
>>>

Reply via email to