On 2/17/12 7:44 PM, bearophile wrote:
A tiny little file lines reading benchmark I've just found on Reddit:
http://www.reddit.com/r/programming/comments/pub98/a_benchmark_for_reading_flat_files_into_memory/

http://steve.80cols.com/reading_flat_files_into_memory_benchmark.html

The Ruby code that generates slowly the test data:
https://raw.github.com/lorca/flat_file_benchmark/master/gen_data.rb
But for my timings I have used only about a 40% of that file, the first 
1_965_800 lines, because I have less memory.

My Python-Psyco version runs in 2.46 seconds, the D version in 4.65 seconds 
(the D version runs in 13.20 seconds if I don't disable the GC).

 From many other benchmarks I've seen that file reading line-by-line is slow in 
D.

-------------------------
My D code:
[snip]

The thread in D.announce prompted me to go back to this, and I've run a simple test that isolates file reads from everything else. After generating the CSV data as described above, I ran this Python code:

import sys

rows = []
f = open(sys.argv[1])
for line in f:
    if len(line) > 10000: rows.append(line[:-1].split("\t"))

and this D code:

import std.stdio, std.string, std.array;

void main(in string[] args) {
    Appender!(string[][]) rows;
    auto f = File(args[1]);
    foreach (line; f.byLine()) {
        if (line.length > 10000) rows.put(line.idup.split("\t"));
    }
}

Both programs end up appending nothing because 10000 is larger than any line length.

On my machine (Mac OSX Lion), the Python code clocks around 1.2 seconds and the D code at a whopping 9.3 seconds. I looked around where the problem lies and sure enough the issue was with a slow loop in the generic I/O implementation of readln. The commit https://github.com/D-Programming-Language/phobos/commit/94b21d38d16e075d7c44b53015eb1113854424d0 brings the speed of the test to 2.1 seconds. We could and should reduce that further with taking buffering in our own hands, but for now this is a good low-hanging fruit to pick.


Andrei

Reply via email to