Independent of Jacob's (excellent) work, I've begun wrapping the SFrames library (https://github.com/dato-code/SFrame), which is used internally by Graphlab Create (https://dato.com/products/create/). One of its features is a fast and robust CSV reader. I think I'll have something to preview in the next week or so.
On Wednesday, October 7, 2015 at 2:33:41 AM UTC-4, bernhard wrote: > > Is there any update on this? Or maybe a timeline/roadmap? > I would love to see a faster CSV reader. > > I tried to take a look at Jacob's CSV.jl. > But I seem to be missing https://github.com/lindahua/DataStreams.jl > I have no idea where to find DataStreams package.... > Does it still exist? > > Is there any (experimental) way to make CSV.jl work? > > > >> Am Samstag, 6. Juni 2015 14:41:36 UTC+2 schrieb David Gold: >> >> @Jacob, >> >> Thank you very much for your explanation! I expect having such a >> blueprint will make delving into the actual code more tractable for me. >> I'll be curious to see how your solution here and your proposal for string >> handling end up playing with the current Julia data ecosystem. >> >> On Saturday, June 6, 2015 at 1:17:34 AM UTC-4, Jacob Quinn wrote: >>> >>> @David, >>> >>> Sorry for the slow response. It's been a busy week :) >>> >>> Here's a quick rundown of the approach: >>> >>> - In the still-yet-to-be-officially-published >>> https://github.com/quinnj/CSV.jl package, the bulk of the code goes >>> into creating a `CSV.File` type where the structure/metadata of the file is >>> parsed/detected/saved in a type (e.g. header, delimiter, newline, # of >>> columns, detected column types, etc.) >>> - `SQLite.create` and now `CSV.read` both take a `CSV.File` as input and >>> follow a similar process in parsing: >>> - The actual file contents are mmapped; i.e. the entire file is loaded >>> into memory at once >>> - There are currently three `readfield` methods (Int,Float64,String) >>> that take an open `CSV.Stream` type (which holds the mmapped data and the >>> current "position" of parsing), and read a single field according to what >>> the type of that column is supposed to be >>> - for example, readfield(io::CSV.Stream, ::Type{Float64}, row, >>> col), will start reading at the current position of the `CSV.Stream` until >>> it hits the next delimiter, newline, or end of the file and then interpret >>> the contents as a Float64, returning `val, isnull` >>> >>> That's pretty much it. One of the most critical performance keys for >>> both SQLite and CSV.read is non-copying strings once the file has been >>> mmapped. For SQLite, the sqlite3_bind_text library method actually has a >>> flag to indicate whether the text should be copied or not, so we're able to >>> pass the pointer to the position in the mmapped array directly. For the >>> CSV.read method, which returns a Vector of the columns (as typed arrays), >>> I've actually rolled a quick and dirty CString type that looks like >>> >>> immutable CString >>> ptr::Ptr{UInt8} >>> len::Int >>> end >>> >>> With a few extra method definitions, this type looks very close to a >>> real string type, but we can construct it by pointing directly to the >>> mmapped region (which currently isn't possible for native Julia string >>> types). See https://github.com/quinnj/Strings.jl for more brainstorming >>> around this alternative string implementation. You can convert a CString to >>> a Julia string by calling string(x::CString) or map(string,column) for an >>> Array of CSV.CStrings. >>> >>> As an update on the performance on the Facebook Kaggle competition >>> bids.csv file: >>> >>> -readcsv: 45 seconds, 33% gc time >>> -CSV.read: 19 seconds, 3% gc time >>> -SQLite.create: 25 seconds, 3.25% gc time >>> >>> Anyway, hopefully I'll get around to cleaning up CSV.jl to be released >>> officially, but it's that last 10-20% that's always the hardest to finish >>> up :) >>> >>> -Jacob >>> >>> >>> >>> On Mon, Jun 1, 2015 at 4:25 PM, David Gold <david....@gmail.com> wrote: >>> >>>> @Jacob I'm just developing a working understanding of these issues. >>>> Would you please help me to get a better handle on your solution? >>>> >>>> My understanding thus far: Reading a (local) .csv file into a DataFrame >>>> using DataFrames.readtable involves reading the file into an IOStream and >>>> then parsing that stream into a form amenable to parsing by >>>> DataFrames.builddf, which builds the DataFrame object returned by >>>> readtable. The work required to get the contents of the .csv file into >>>> memory in a form that can be manipulated by Julia functions is >>>> work-intensive in this manner. However, with SQLite, the entire file can >>>> just be thrown into memory wholesale, along with some metadata (maybe not >>>> the right term?) that delineates the tabular properties of the data. >>>> >>>> What I am curious about, then (if this understanding is not too >>>> misguided), is how SQLite returns, say, a column of data that doesn't >>>> include, say, a bunch of delimiters. That is, what sort of parsing *does* >>>> SQLite do, and when? >>>> >>>> On Monday, June 1, 2015 at 1:48:16 PM UTC-4, Jacob Quinn wrote: >>>>> >>>>> The biggest single advantage SQLite has is the ability to mmap a file >>>>> and just tell SQLite which pointer addresses start strings and how long >>>>> they are, all without copying. The huge, huge bottleneck in most >>>>> implementations, is not just identifying where a string starts and how >>>>> long >>>>> it is, but then allocating "in program" memory and copying the string >>>>> into >>>>> it. With SQLite, we can use an in-memory database, mmap the file, and >>>>> tell >>>>> SQLite where each string for a column lives by giving it the starting >>>>> pointer address and how long it is. I've been looking into how to solve >>>>> this problem over the last month or so (apart from Oscar's gc wizardry) >>>>> and >>>>> it just occurred to me last week that using SQLite may be the best way; >>>>> so >>>>> far, the results are promising! >>>>> >>>>> -Jacob >>>>> >>>>> On Mon, Jun 1, 2015 at 11:40 AM, <verylu...@gmail.com> wrote: >>>>> >>>>>> Great, thank you Jacob, I will try it out! >>>>>> >>>>>> Do you have a writeup on differences in the way you read CSV files >>>>>> and the way it is currently done in Julia? Would love to know more! >>>>>> >>>>>> Obvious perhaps but for completeness: Reading the data using readcsv >>>>>> or readdlm does not improve much the metrics I reported, suggesting that >>>>>> the overhead from DataFrames is not much. >>>>>> >>>>>> Thank you again! >>>>>> >>>>>> On Monday, June 1, 2015 at 1:06:50 PM UTC-4, Jacob Quinn wrote: >>>>>>> >>>>>>> I've been meaning to clean some things up and properly release the >>>>>>> functionality, but I have a new way to read in CSV files that beats >>>>>>> anything else out there that I know of. To get the functionality, >>>>>>> you'll >>>>>>> need to be running 0.4 master, then do >>>>>>> >>>>>>> Pkg.add("SQLite") >>>>>>> Pkg.checkout("SQLite","jq/updates") >>>>>>> Pkg.clone("https://github.com/quinnj/CSV.jl") >>>>>>> Pkg.clone("https://github.com/quinnj/Mmap.jl") >>>>>>> >>>>>>> I then ran the following on the bids.csv file >>>>>>> >>>>>>> using SQLite, CSV >>>>>>> >>>>>>> db = SQLite.SQLiteDB() >>>>>>> >>>>>>> ff = CSV.File("/Users/jacobquinn/Downloads/bids.csv") >>>>>>> >>>>>>> @time lines = SQLite.create(db, ff,"temp2") >>>>>>> >>>>>>> It took 18 seconds on my newish MBP. From the R data.table package, >>>>>>> the `fread` is the other fastest CSV I know of and it took 34 seconds >>>>>>> on my >>>>>>> machine. I'm actually pretty surprised by that, since in other tests >>>>>>> I've >>>>>>> done it was on par with the SQLite+CSV or sometimes slightly faster. >>>>>>> >>>>>>> Now, you're not necessarily getting a Julia structure in this case, >>>>>>> but it's loading the data into an SQLite table, that you can then run >>>>>>> SQLite.query(db, sql_string) to do manipulations and such. >>>>>>> >>>>>>> -Jacob >>>>>>> >>>>>>> >>>>>>> On Sun, May 31, 2015 at 9:42 PM, <verylu...@gmail.com> wrote: >>>>>>> >>>>>>>> Thank you Tim and Jiahao for your responses. Sorry, I did not >>>>>>>> mention in my OP that I was using Version 0.3.10-pre+1 (2015-05-30 >>>>>>>> 11:26 >>>>>>>> UTC) Commit 80dd75c* (1 day old release-0.3). >>>>>>>> >>>>>>>> I tried other releases as Tim suggested: >>>>>>>> >>>>>>>> On Version 0.4.0-dev+5121 (2015-05-31 12:13 UTC) Commit bfa8648* (0 >>>>>>>> days old master), >>>>>>>> the same command takes 14 minutes - half that it was taking with >>>>>>>> release-0.3 but still 3 times more than that taken by R's read.csv (5 >>>>>>>> min). >>>>>>>> More important, Julia process takes up 8GB memory (Rsession takes >>>>>>>> 1.6GB) >>>>>>>> output of the command `@time DataFrames.readtable("bids.csv");` is >>>>>>>> 857.120 seconds (352 M allocations: 16601 MB, 71.59% gc time) >>>>>>>> # reduced from 85% to 71% >>>>>>>> >>>>>>>> For completeness, On Version 0.4.0-dev+4451 (2015-04-22 21:55 UTC) >>>>>>>> ob/gctune/238ed08* (fork: 1 commits, 39 days), the command `@time >>>>>>>> DataFrames.readtable("bids.csv");` takes 21 minutes; the output of the >>>>>>>> macro is: >>>>>>>> elapsed time: 1303.167204109 seconds (18703 MB allocated, 76.58% gc >>>>>>>> time in 33 pauses with 31 full sweep) >>>>>>>> The process also takes up 8GB memory on the machine, more than the >>>>>>>> earlier one. My machine has also significantly slowed down - so >>>>>>>> perhaps the >>>>>>>> increase in memory when compared to release-0.3 is significant. >>>>>>>> >>>>>>>> On disabling gc, my machine (4GB laptop) goes soul searching; so >>>>>>>> its not an option for now. >>>>>>>> >>>>>>>> Is this the best one can expect for now? I read the discussion on >>>>>>>> issue #10428 but I did not understand it well :-( >>>>>>>> >>>>>>>> Thank you! >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Sunday, May 31, 2015 at 9:25:14 PM UTC-4, Jiahao Chen wrote: >>>>>>>>> >>>>>>>>> Not ideal, but for now you can try turning off the garbage >>>>>>>>> collection while reading in the DataFrame. >>>>>>>>> >>>>>>>>> gc_disable() >>>>>>>>> df = DataFrames.readtable("bids.csv") >>>>>>>>> gc_enable() >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> Jiahao Chen >>>>>>>>> Research Scientist >>>>>>>>> MIT CSAIL >>>>>>>>> >>>>>>>>> On Mon, Jun 1, 2015 at 1:36 AM, Tim Holy <tim....@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> If you're using julia 0.3, you might want to try current master >>>>>>>>>> and/or >>>>>>>>>> possibly the "ob/gctune" branch. >>>>>>>>>> >>>>>>>>>> https://github.com/JuliaLang/julia/issues/10428 >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> --Tim >>>>>>>>>> >>>>>>>>>> On Sunday, May 31, 2015 09:50:03 AM verylu...@gmail.com wrote: >>>>>>>>>> > Facebook's Kaggle competition has a dataset with ~7.6e6 rows >>>>>>>>>> with 9 columns >>>>>>>>>> > (mostly >>>>>>>>>> > strings). >>>>>>>>>> https://www.kaggle.com/c/facebook-recruiting-iv-human-or-bot/data >>>>>>>>>> > >>>>>>>>>> > Loading the dataset in R using read.csv takes 5 minutes and the >>>>>>>>>> resulting >>>>>>>>>> > dataframe takes 0.6GB (RStudio takes a total of 1.6GB memory on >>>>>>>>>> my machine) >>>>>>>>>> > >>>>>>>>>> > >t0 = proc.time(); a = read.csv("bids.csv"); proc.time()-t0 >>>>>>>>>> > >>>>>>>>>> > user system elapsed >>>>>>>>>> > 332.295 4.154 343.332 >>>>>>>>>> > >>>>>>>>>> > > object.size(a) >>>>>>>>>> > >>>>>>>>>> > 601496056 bytes #(0.6 GB) >>>>>>>>>> > >>>>>>>>>> > Loading the same dataset using DataFrames' readtable takes >>>>>>>>>> about 30 minutes >>>>>>>>>> > on the same machine (varies a bit, lowest is 25 minutes) and >>>>>>>>>> the resulting >>>>>>>>>> > (Julia process, REPL on Terminal, takes 6GB memory on the same >>>>>>>>>> machine) >>>>>>>>>> > >>>>>>>>>> > (I added couple of calls to @time macro inside the readtable >>>>>>>>>> function to >>>>>>>>>> > see whats taking time - outcomes of these calls too are below) >>>>>>>>>> > >>>>>>>>>> > julia> @time DataFrames.readtable("bids.csv"); >>>>>>>>>> > WARNING: Begin readnrows call >>>>>>>>>> > elapsed time: 29.517358476 seconds (2315258744 bytes >>>>>>>>>> allocated, 0.35% gc >>>>>>>>>> > time) >>>>>>>>>> > WARNING: End readnrows call >>>>>>>>>> > WARNING: Begin builddf call >>>>>>>>>> > elapsed time: 1809.506275842 seconds (18509704816 bytes >>>>>>>>>> allocated, 85.54% >>>>>>>>>> > gc time) >>>>>>>>>> > WARNING: End builddf call >>>>>>>>>> > elapsed time: 1840.471467982 seconds (21808681500 bytes >>>>>>>>>> allocated, 84.12% >>>>>>>>>> > gc time) #total time for loading >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > Can you please suggest how I can improve load time and memory >>>>>>>>>> usage in >>>>>>>>>> > DataFrames for sizes this big and bigger? >>>>>>>>>> > >>>>>>>>>> > Thank you! >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>>