Re: Idiomatic tokenizing and performance
Thanks, I should have done that tbh. my code is on github https://github.com/andrewchambers/ccc/blob/master/src/ccc/lex.clj . Don't think it compiles or runs on master currently though. If anyone is interested im trying to test the feasibility/size/maintainability of a clojure (or clojurescript) version of this guys tiny C compiler https://github.com/rui314/8cc. Basically to compare functional programming with what i consider excellent C code. Cheers, On Monday, May 5, 2014 4:43:27 PM UTC+12, Atamert Ölçgen wrote: > > I created a gist of your code for better readability, I hope you don't > mind. > > https://gist.github.com/muhuk/7c4a2b8db63886e2a9cd > > > On Mon, May 5, 2014 at 12:36 PM, Andrew Chambers > > > wrote: > >> I've been trying to make a tokenizer/lexer for a project of mine and came >> up with the following code, >> I've modelled the stream of characters as seq/lazy of chars which is then >> converted to a lazy-seq of token objects. >> I'm relatively happy with how idiomatic and functional the code seems, >> however when benchmarked, the code takes about 30 seconds on clojure (after >> i increase the heap to 1 gig) >> to process a 30 meg file, and over 1 minute 30 seconds with >> clojurescript. This is in contrast to about of 0.1 to 0.5 seconds or less >> in C. Is >> there any idiomatic way to process the file without being a factor of 100 >> times slower than C? >> >> Also, is there a tool for clojure similar to gprof for C? >> >> >> Each function takes in a char seq and returns both a token and the seq >> after its been advanced. >> >> (defn match-ident >> >> [cs] >> (let [start (first cs)] >> (if (ident-first-char? start) >> >> (let [ identseq (cons start (take-while ident-tail-char? (rest cs))) >> >> ^String ident (apply str identseq)] >> [(drop (.length ident) cs) [:ident ident]] >> >> >> (defn match-num >> [cs] >> (if (digit? (first cs)) >> >> (let [ numseq (take-while digit? cs) >>^String numstr (apply str numseq) >> >>retseq (drop (.length numstr) cs)] >> (if (= (first retseq) \.) >> >> nil >> [retseq [:number numstr]] >> >> (defn match-ws >> >> [cs] >> (if (whitespace-char? (first cs)) >> (let [ wsseq (take-while whitespace-char? cs) >> >>^String wsstr (apply str wsseq) >>retseq (drop (.length wsstr) cs)] >> >> [retseq [:ws wsstr]]))) >> >> >> ... >> >> (defn next-token >> >> [cs] >> (or (match-ident cs) >> (match-ws cs) >> >> (match-punct cs) >> (match-num cs) >> (match-eof cs) >> >> (match-unknown cs))) >> >> ;; Here I build the lazy seq of tokens. >> >> (defn token-seq >> [cs] >> >> (let [[newcs tok] (next-token cs)] >> (lazy-seq (cons tok (token-seq newcs) >> >> >> Cheers, >> Andrew Chambers >> >> >> -- >> You received this message because you are subscribed to the Google >> Groups "Clojure" group. >> To post to this group, send email to clo...@googlegroups.com >> Note that posts from new members are moderated - please be patient with >> your first post. >> To unsubscribe from this group, send email to >> clojure+u...@googlegroups.com >> For more options, visit this group at >> http://groups.google.com/group/clojure?hl=en >> --- >> You received this message because you are subscribed to the Google Groups >> "Clojure" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to clojure+u...@googlegroups.com . >> For more options, visit https://groups.google.com/d/optout. >> > > > > -- > Kind Regards, > Atamert Ölçgen > > -+- > --+ > +++ > > www.muhuk.com > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Idiomatic tokenizing and performance
I created a gist of your code for better readability, I hope you don't mind. https://gist.github.com/muhuk/7c4a2b8db63886e2a9cd On Mon, May 5, 2014 at 12:36 PM, Andrew Chambers wrote: > I've been trying to make a tokenizer/lexer for a project of mine and came > up with the following code, > I've modelled the stream of characters as seq/lazy of chars which is then > converted to a lazy-seq of token objects. > I'm relatively happy with how idiomatic and functional the code seems, > however when benchmarked, the code takes about 30 seconds on clojure (after > i increase the heap to 1 gig) > to process a 30 meg file, and over 1 minute 30 seconds with clojurescript. > This is in contrast to about of 0.1 to 0.5 seconds or less in C. Is > there any idiomatic way to process the file without being a factor of 100 > times slower than C? > > Also, is there a tool for clojure similar to gprof for C? > > > Each function takes in a char seq and returns both a token and the seq > after its been advanced. > > (defn match-ident > [cs] > (let [start (first cs)] > (if (ident-first-char? start) > (let [ identseq (cons start (take-while ident-tail-char? (rest cs))) > ^String ident (apply str identseq)] > [(drop (.length ident) cs) [:ident ident]] > > (defn match-num > [cs] > (if (digit? (first cs)) > (let [ numseq (take-while digit? cs) >^String numstr (apply str numseq) >retseq (drop (.length numstr) cs)] > (if (= (first retseq) \.) > nil > [retseq [:number numstr]] > > (defn match-ws > [cs] > (if (whitespace-char? (first cs)) > (let [ wsseq (take-while whitespace-char? cs) >^String wsstr (apply str wsseq) >retseq (drop (.length wsstr) cs)] > [retseq [:ws wsstr]]))) > > > ... > > (defn next-token > [cs] > (or (match-ident cs) > (match-ws cs) > (match-punct cs) > (match-num cs) > (match-eof cs) > (match-unknown cs))) > > ;; Here I build the lazy seq of tokens. > > (defn token-seq > [cs] > (let [[newcs tok] (next-token cs)] > (lazy-seq (cons tok (token-seq newcs) > > > Cheers, > Andrew Chambers > > > -- > You received this message because you are subscribed to the Google > Groups "Clojure" group. > To post to this group, send email to clojure@googlegroups.com > Note that posts from new members are moderated - please be patient with > your first post. > To unsubscribe from this group, send email to > clojure+unsubscr...@googlegroups.com > For more options, visit this group at > http://groups.google.com/group/clojure?hl=en > --- > You received this message because you are subscribed to the Google Groups > "Clojure" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to clojure+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. > -- Kind Regards, Atamert Ölçgen -+- --+ +++ www.muhuk.com -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Idiomatic tokenizing and performance
I've been trying to make a tokenizer/lexer for a project of mine and came up with the following code, I've modelled the stream of characters as seq/lazy of chars which is then converted to a lazy-seq of token objects. I'm relatively happy with how idiomatic and functional the code seems, however when benchmarked, the code takes about 30 seconds on clojure (after i increase the heap to 1 gig) to process a 30 meg file, and over 1 minute 30 seconds with clojurescript. This is in contrast to about of 0.1 to 0.5 seconds or less in C. Is there any idiomatic way to process the file without being a factor of 100 times slower than C? Also, is there a tool for clojure similar to gprof for C? Each function takes in a char seq and returns both a token and the seq after its been advanced. (defn match-ident [cs] (let [start (first cs)] (if (ident-first-char? start) (let [ identseq (cons start (take-while ident-tail-char? (rest cs))) ^String ident (apply str identseq)] [(drop (.length ident) cs) [:ident ident]] (defn match-num [cs] (if (digit? (first cs)) (let [ numseq (take-while digit? cs) ^String numstr (apply str numseq) retseq (drop (.length numstr) cs)] (if (= (first retseq) \.) nil [retseq [:number numstr]] (defn match-ws [cs] (if (whitespace-char? (first cs)) (let [ wsseq (take-while whitespace-char? cs) ^String wsstr (apply str wsseq) retseq (drop (.length wsstr) cs)] [retseq [:ws wsstr]]))) ... (defn next-token [cs] (or (match-ident cs) (match-ws cs) (match-punct cs) (match-num cs) (match-eof cs) (match-unknown cs))) ;; Here I build the lazy seq of tokens. (defn token-seq [cs] (let [[newcs tok] (next-token cs)] (lazy-seq (cons tok (token-seq newcs) Cheers, Andrew Chambers -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.