I think the ruby version is smarter than your Racket version. Specifically, if you remove the .size (but don't print the output), things slow down by a factor of about 2 for me.
$ time ruby -e 'p File.read("X_train.txt").split(/\s+/).map(&:to_f)' > /dev/null real 0m1.127s user 0m1.071s sys 0m0.054s $ time ruby -e 'p File.read("X_train.txt").split(/\s+/).map(&:to_f).size' 655360 real 0m0.561s user 0m0.512s sys 0m0.047s But meanwhile, it looks like the regexp matching is slower than in Ruby. Below are some variations on the code that explore some speedups for the computation you intended to happen, I believe. The middle version (fetch-whitespace-delimited-string2) takes the same amount of time as the version of ruby that is forced to actually compute the list. So maybe there are still more improvements to be done to regexp port matching, at least for certain regexps, I'm not sure. Robby #lang racket (define (main fetch-whitespace-delimited-string) (length (call-with-input-file "X_train.txt" (λ (port) (let loop () (define m (fetch-whitespace-delimited-string port)) (cond [m (cons (string->number m) (loop))] [else '()])))))) ;; this one uses a regexp on the port (define (fetch-whitespace-delimited-string1 p) (define m (regexp-match #rx"([^ \n]*)[ \n]" p)) (cond [m (bytes->string/utf-8 (list-ref m 1))] [else #f])) ;; this one has a specialized loop that ;; expects exactly one whitespace char (define (fetch-whitespace-delimited-string2 p) (let loop ([cs '()]) (define c (read-char p)) (cond [(or (eof-object? c) (memq c '(#\space #\newline))) (cond [(null? cs) #f] [else (apply string (reverse cs))])] [else (loop (cons c cs))]))) ;; this one doesn't do any IO, just returns the same number (begin (define c 655360) ;; put the number of numbers in the file here (define (fetch-whitespace-delimited-string3 p) (begin0 (cond [(= c 0) #f] [else "2.8858451e-001"]) (set! c (- c 1))))) (collect-garbage) (collect-garbage) (collect-garbage) (time (main fetch-whitespace-delimited-string1)) (collect-garbage) (collect-garbage) (collect-garbage) (time (main fetch-whitespace-delimited-string2)) (collect-garbage) (collect-garbage) (collect-garbage) (time (main fetch-whitespace-delimited-string3)) On Wed, Jun 18, 2014 at 1:51 AM, Ryan Davis <zenspi...@gmail.com> wrote: > > On Jun 17, 2014, at 23:26, Robby Findler <ro...@eecs.northwestern.edu> wrote: > >> Could it be the float parser? Do you actually get the same floats in >> both versions? > > Yup. I'm getting floats for sure. > > The majority of the time in the original version was definitely being spent > on string-split (and I confirmed that regexp-split is basically the same, > even with bytestrings). > > I found a post about regexp-split being dreadfully slow for a big string and > a tiny pattern that Matt Flatt fixed a long while ago. This might be a > similar edge case? big string, tiny pattern, but tons of matches? ____________________ Racket Users list: http://lists.racket-lang.org/users