Re: [julia-users] DataFrames' readtable very slow compared to R's read.csv when loading ~7.6M csv rows

2015-06-01 Thread veryluckyxyz
Great, thank you Jacob, I will try it out! Do you have a writeup on differences in the way you read CSV files and the way it is currently done in Julia? Would love to know more! Obvious perhaps but for completeness: Reading the data using readcsv or readdlm does not improve much the metrics I

Re: [julia-users] DataFrames' readtable very slow compared to R's read.csv when loading ~7.6M csv rows

2015-05-31 Thread veryluckyxyz
Thank you Tim and Jiahao for your responses. Sorry, I did not mention in my OP that I was using Version 0.3.10-pre+1 (2015-05-30 11:26 UTC) Commit 80dd75c* (1 day old release-0.3). I tried other releases as Tim suggested: On Version 0.4.0-dev+5121 (2015-05-31 12:13 UTC) Commit bfa8648* (0 days

[julia-users] DataFrames' readtable very slow compared to R's read.csv when loading ~7.6M csv rows

2015-05-31 Thread veryluckyxyz
Facebook's Kaggle competition has a dataset with ~7.6e6 rows with 9 columns (mostly strings). https://www.kaggle.com/c/facebook-recruiting-iv-human-or-bot/data Loading the dataset in R using read.csv takes 5 minutes and the resulting dataframe takes 0.6GB (RStudio takes a total of 1.6GB memory

Re: [julia-users] Re: Confidence for regression

2015-04-01 Thread veryluckyxyz
Ah OK, Sorry for misunderstanding the question. Yes, there are no methods yet to compute intervals for predicted values. Sorry again! On Wednesday, April 1, 2015 at 2:06:07 PM UTC-4, tshort wrote: > > I think the question was for prediction intervals. I don't see that in > GLM, yet. > > On Wed,

[julia-users] Re: Confidence for regression

2015-04-01 Thread veryluckyxyz
GLM.jl have prediction methods for Linear and Generalized Linear Models. They take corresponding models and features as input. Please see for implementation details [glm] https://github.com/JuliaStats/GLM.jl/blob/a7fb0057a7bc835d819e842c6f42f14601840a1b/src/glmfit.jl#L249 and [lm] https://githu

[julia-users] Re: Efficient way to split an array/dataframe?

2015-03-26 Thread veryluckyxyz
Thank you Gunner, Tim, and James! These are great solutions and many times faster than my implementation. On Thursday, March 26, 2015 at 10:17:10 AM UTC-4, James Fairbanks wrote: > > Since you mentioned a test set and a training set. You might want to use > MLBase.jl which has reusable tools fo

[julia-users] Efficient way to split an array/dataframe?

2015-03-25 Thread veryluckyxyz
Hi, I have an array of 100 elements. I want to split the array to 70 (test set) and 30 (train set) randomly. N=100 A = rand(N); n = convert(Int, ceil(N*0.7)) testindex = sample(1:size(A,1), replace=false,n) testA = A[testindex]; How can I get the train set? I could loop through testA and A to g

Re: [julia-users] Symbol to string

2015-03-24 Thread veryluckyxyz
`convert` does not seem to work with symbols. I get an error (same for String and UTF8String as well). Test commands: testdf = DataFrame(A = [1,2,3], B=[2,3,4]) string(names(testdf)[1]) # works convert(String, names(testdf)[1]) # throws error convert(ASCIIString, names(testdf)[1]) # throws error c