A file reading benchmark

bearophile Fri, 17 Feb 2012 17:45:19 -0800

A tiny little file lines reading benchmark I've just found on Reddit:
http://www.reddit.com/r/programming/comments/pub98/a_benchmark_for_reading_flat_files_into_memory/


http://steve.80cols.com/reading_flat_files_into_memory_benchmark.html

The Ruby code that generates slowly the test data:
https://raw.github.com/lorca/flat_file_benchmark/master/gen_data.rb
But for my timings I have used only about a 40% of that file, the first 
1_965_800 lines, because I have less memory.

My Python-Psyco version runs in 2.46 seconds, the D version in 4.65 seconds 
(the D version runs in 13.20 seconds if I don't disable the GC).

>From many other benchmarks I've seen that file reading line-by-line is slow in 
>D.

-------------------------
My D code:

import std.stdio, std.string, std.array;

void main(in string[] args) {
    Appender!(string[][]) rows;
    foreach (line; File(args[1]).byLine())
        rows.put(line.idup.split("\t"));
    writeln(rows.data[1].join(","));
}

-------------------------
My Python 2.6 code:

from sys import argv
from collections import deque
import gc
import psyco

def main():
    gc.disable()
    rows = deque()
    for line in open(argv[1]):
        rows.append(line[:-1].split("\t"))
    print ",".join(rows[1])

psyco.full()
main()

-------------------------
The test data generator in Ruby:

user_id=1
for user_id in (1..10000)
 payments = (rand * 1000).to_i
 for user_payment_id in (1..payments)
   payment_id = user_id.to_s + user_payment_id.to_s
   payment_amount = "%.2f" % (rand * 30);
   is_card_present = "N"
   created_at = (rand * 10000000).to_i
   if payment_id.to_i % 3 == 0
    is_card_present = "Y"
   end
   puts [user_id, payment_id, payment_amount, is_card_present, 
created_at].join("\t")
 end
end

-------------------------

Bye,
bearophile

A file reading benchmark

Reply via email to