Even after letting it run for more than 2 hours it didn't finish, so I ended the process. Now compiling Julia 0.4 in a google cloud instance. I will try it there, where it has higher memory than my local machine.
On Wednesday, October 14, 2015 at 10:24:16 AM UTC+5:30, Grey Marsh wrote: > > I am using Julia 0.4 for this purpose, if that's what is meant by "0.4 > only". > > On Wednesday, October 14, 2015 at 9:53:09 AM UTC+5:30, Jacob Quinn wrote: >> >> Oh yes, I forgot to mention that the CSV/DataStreams code is 0.4 only. >> Definitely interested to hear about any results/experiences though. >> >> -Jacob >> >> On Tue, Oct 13, 2015 at 10:11 PM, Yichao Yu <yyc...@gmail.com> wrote: >> >>> On Wed, Oct 14, 2015 at 12:02 AM, Grey Marsh <kd.k...@gmail.com> wrote: >>> > @Jacob, I tried your approach. Somehow it got stuck in the "@time ds = >>> > DataStreams.DataTable(f)" line. After 15 minutes running, julia is >>> using >>> > ~500mb and 1 cpu core with no sign of end. The memory use has been >>> almost >>> > same for the whole duration of 15 minutes. I'm letting it run, hoping >>> that >>> > it finishes after some time. >>> > >>> > From your run, I can see it needs 12gb memory which is higher than my >>> > machine memory of 8gb. could it be the problem? >>> >>> 12GB is the total number of memory ever allocated during the timing. A >>> lot of them might be intermediate results that are freed by the GC. >>> Also, from the output of @time, it looks like 0.4. >>> >>> > >>> > On Wednesday, October 14, 2015 at 2:28:09 AM UTC+5:30, Jacob Quinn >>> wrote: >>> >> >>> >> I'm hesitant to suggest, but if you're in a bind, I have an >>> experimental >>> >> package for fast CSV reading. The API has stabilized somewhat over >>> the last >>> >> week and I'm planning a more broad release soon, but I'd still >>> consider it >>> >> alpha mode. That said, if anyone's willing to give it a drive, you >>> just need >>> >> to >>> >> >>> >> Pkg.add("Libz") >>> >> Pkg.add("NullableArrays") >>> >> Pkg.clone("https://github.com/quinnj/DataStreams.jl") >>> >> Pkg.clone("https://github.com/quinnj/CSV.jl") >>> >> >>> >> With the original file referenced here I get: >>> >> >>> >> julia> reload("CSV") >>> >> >>> >> julia> f = >>> CSV.Source("/Users/jacobquinn/Downloads/train.csv";null="NA") >>> >> CSV.Source: "/Users/jacobquinn/Downloads/train.csv" >>> >> delim: ',' >>> >> quotechar: '"' >>> >> escapechar: '\\' >>> >> null: "NA" >>> >> schema: >>> >> >>> DataStreams.Schema(UTF8String["ID","VAR_0001","VAR_0002","VAR_0003","VAR_0004","VAR_0005","VAR_0006","VAR_0007","VAR_0008","VAR_0009" >>> >> … >>> >> >>> "VAR_1926","VAR_1927","VAR_1928","VAR_1929","VAR_1930","VAR_1931","VAR_1932","VAR_1933","VAR_1934","target"],[Int64,DataStreams.PointerString,Int64,Int64,Int64,DataStreams.PointerString,Int64,Int64,DataStreams.PointerString,DataStreams.PointerString >>> >> … >>> >> >>> Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,DataStreams.PointerString,Int64],145231,1934) >>> >> dateformat: Base.Dates.DateFormat(Base.Dates.Slot[],"","english") >>> >> >>> >> >>> >> julia> @time ds = DataStreams.DataTable(f) >>> >> 43.513800 seconds (694.00 M allocations: 12.775 GB, 2.55% gc time) >>> >> >>> >> >>> >> You can convert the result to a DataFrame with: >>> >> >>> >> function DataFrames.DataFrame(dt::DataStreams.DataTable) >>> >> cols = dt.schema.cols >>> >> data = Array(Any,cols) >>> >> types = DataStreams.types(dt) >>> >> for i = 1:cols >>> >> data[i] = DataStreams.column(dt,i,types[i]) >>> >> end >>> >> return DataFrame(data,Symbol[symbol(x) for x in dt.schema.header]) >>> >> end >>> >> >>> >> >>> >> -Jacob >>> >> >>> >> On Tue, Oct 13, 2015 at 2:40 PM, feza <moham...@gmail.com> wrote: >>> >>> >>> >>> Finally was able to load it, but the process consumes a ton of >>> memory. >>> >>> julia> @time train = readtable("./test.csv"); >>> >>> 124.575362 seconds (376.11 M allocations: 13.438 GB, 10.77% gc time) >>> >>> >>> >>> >>> >>> >>> >>> On Tuesday, October 13, 2015 at 4:34:05 PM UTC-4, feza wrote: >>> >>>> >>> >>>> Same here on a 12gb ram machine >>> >>>> >>> >>>> _ >>> >>>> _ _ _(_)_ | A fresh approach to technical computing >>> >>>> (_) | (_) (_) | Documentation: http://docs.julialang.org >>> >>>> _ _ _| |_ __ _ | Type "?help" for help. >>> >>>> | | | | | | |/ _` | | >>> >>>> | | |_| | | | (_| | | Version 0.5.0-dev+429 (2015-09-29 09:47 >>> UTC) >>> >>>> _/ |\__'_|_|_|\__'_| | Commit f71e449 (14 days old master) >>> >>>> |__/ | x86_64-w64-mingw32 >>> >>>> >>> >>>> julia> using DataFrames >>> >>>> >>> >>>> julia> train = readtable("./test.csv"); >>> >>>> ERROR: OutOfMemoryError() >>> >>>> in resize! at array.jl:452 >>> >>>> in readnrows! at >>> >>>> C:\Users\Mustafa\.julia\v0.5\DataFrames\src\dataframe\io.jl:164 >>> >>>> in readtable! at >>> >>>> C:\Users\Mustafa\.julia\v0.5\DataFrames\src\dataframe\io.jl:767 >>> >>>> in readtable at >>> >>>> C:\Users\Mustafa\.julia\v0.5\DataFrames\src\dataframe\io.jl:847 >>> >>>> in readtable at >>> >>>> C:\Users\Mustafa\.julia\v0.5\DataFrames\src\dataframe\io.jl:893 >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> On Tuesday, October 13, 2015 at 3:47:58 PM UTC-4, Yichao Yu wrote: >>> >>>>> >>> >>>>> >>> >>>>> On Oct 13, 2015 2:47 PM, "Grey Marsh" <kd.k...@gmail.com> wrote: >>> >>>>> >>> >>>>> Which julia version are you using. There's sime gc tweak on 0.4 for >>> >>>>> that. >>> >>>>> >>> >>>>> > >>> >>>>> > I was trying to load the training dataset from springleaf >>> marketing >>> >>>>> > response on Kaggle. The csv is 921 mb, has 145321 row and 1934 >>> columns. My >>> >>>>> > machine has 8 gb ram and julia ate 5.8gb+ memory after that I >>> stopped julia >>> >>>>> > as there was barely any memory left for OS to function properly. >>> It took >>> >>>>> > about 5-6 minutes later for the incomplete operation. I've >>> windows 8 64bit. >>> >>>>> > Used the following code to read the csv to Julia: >>> >>>>> > >>> >>>>> > using DataFrames >>> >>>>> > train = readtable("C:\\train.csv") >>> >>>>> > >>> >>>>> > Next I tried to to load the same file in python: >>> >>>>> > >>> >>>>> > import pandas as pd >>> >>>>> > train = pd.read_csv("C:\\train.csv") >>> >>>>> > >>> >>>>> > This took ~2.4gb memory, about a minute time >>> >>>>> > >>> >>>>> > Checking the same in R again: >>> >>>>> > df = read.csv('E:/Libraries/train.csv', as.is = T) >>> >>>>> > >>> >>>>> > This took 2-3 minutes and consumes 3.5gb mem on the same machine. >>> >>>>> > >>> >>>>> > Why such discrepancy and why Julia even fails to load the csv >>> before >>> >>>>> > running out of memory? If there is any better way to get the >>> file loaded in >>> >>>>> > Julia? >>> >>>>> > >>> >>>>> > >>> >> >>> >> >>> > >>> >> >>