Re: Idiomatic tokenizing and performance

Atamert Ölçgen Sun, 04 May 2014 21:44:26 -0700

I created a gist of your code for better readability, I hope you don't mind.


https://gist.github.com/muhuk/7c4a2b8db63886e2a9cd


On Mon, May 5, 2014 at 12:36 PM, Andrew Chambers
<andrewchambe...@gmail.com>wrote:

> I've been trying to make a tokenizer/lexer for a project of mine and came
> up with the following code,
> I've modelled the stream of characters as seq/lazy of chars which is then
> converted to a lazy-seq of token objects.
> I'm relatively happy with how idiomatic and functional the code seems,
> however when benchmarked, the code takes about 30 seconds on clojure (after
> i increase the heap to 1 gig)
> to process a 30 meg file, and over 1 minute 30 seconds with clojurescript.
> This is in contrast to about of 0.1 to 0.5 seconds or less in C. Is
> there any idiomatic way to process the file without being a factor of 100
> times slower than C?
>
> Also, is there a tool for clojure similar to gprof for C?
>
>
> Each function takes in a char seq and returns both a token and the seq
> after its been advanced.
>
> (defn match-ident
>   [cs]
>   (let [start (first cs)]
>     (if (ident-first-char? start)
>       (let [ identseq (cons start (take-while ident-tail-char? (rest cs)))
>              ^String ident (apply str identseq)]
>         [(drop (.length ident) cs) [:ident ident]]))))
>
> (defn match-num
>   [cs]
>   (if (digit? (first cs))
>     (let [ numseq (take-while digit? cs)
>            ^String numstr (apply str numseq)
>            retseq (drop (.length numstr) cs)]
>       (if (= (first retseq) \.)
>         nil
>         [retseq [:number numstr]]))))
>
> (defn match-ws
>   [cs]
>   (if (whitespace-char? (first cs))
>     (let [ wsseq (take-while whitespace-char? cs)
>            ^String wsstr (apply str wsseq)
>            retseq (drop (.length wsstr) cs)]
>       [retseq [:ws wsstr]])))
>
>
> ...
>
> (defn next-token
>   [cs]
>   (or (match-ident cs)
>       (match-ws cs)
>       (match-punct cs)
>       (match-num cs)
>       (match-eof cs)
>       (match-unknown cs)))
>
> ;; Here I build the lazy seq of tokens.
>
> (defn token-seq
>   [cs]
>   (let [[newcs tok] (next-token cs)]
>     (lazy-seq (cons tok (token-seq newcs)))))
>
>
> Cheers,
> Andrew Chambers
>
>
>  --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to clojure+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>



-- 
Kind Regards,
Atamert Ölçgen

-+-
--+
+++

www.muhuk.com

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Idiomatic tokenizing and performance

Reply via email to