subject:"Idiomatic tokenizing and performance"

Re: Idiomatic tokenizing and performance

2014-05-04 Thread Andrew Chambers

Thanks, I should have done that tbh. my code is on github 
https://github.com/andrewchambers/ccc/blob/master/src/ccc/lex.clj . Don't 
think it compiles or runs on master currently though.
If anyone is interested im trying to test the 
feasibility/size/maintainability of a clojure (or clojurescript) version of 
this guys tiny C compiler https://github.com/rui314/8cc. Basically to 
compare functional programming with what i consider excellent C code.

Cheers,

On Monday, May 5, 2014 4:43:27 PM UTC+12, Atamert Ölçgen wrote:
>
> I created a gist of your code for better readability, I hope you don't 
> mind.
>
> https://gist.github.com/muhuk/7c4a2b8db63886e2a9cd
>
>
> On Mon, May 5, 2014 at 12:36 PM, Andrew Chambers 
> 
> > wrote:
>
>> I've been trying to make a tokenizer/lexer for a project of mine and came 
>> up with the following code,
>> I've modelled the stream of characters as seq/lazy of chars which is then 
>> converted to a lazy-seq of token objects.
>> I'm relatively happy with how idiomatic and functional the code seems, 
>> however when benchmarked, the code takes about 30 seconds on clojure (after 
>> i increase the heap to 1 gig)
>> to process a 30 meg file, and over 1 minute 30 seconds with 
>> clojurescript. This is in contrast to about of 0.1 to 0.5 seconds or less 
>> in C. Is
>> there any idiomatic way to process the file without being a factor of 100 
>> times slower than C?
>>
>> Also, is there a tool for clojure similar to gprof for C? 
>>
>>
>> Each function takes in a char seq and returns both a token and the seq 
>> after its been advanced.
>>
>> (defn match-ident
>>
>>   [cs]
>>   (let [start (first cs)]
>> (if (ident-first-char? start)
>>
>>   (let [ identseq (cons start (take-while ident-tail-char? (rest cs)))
>>
>>  ^String ident (apply str identseq)]
>> [(drop (.length ident) cs) [:ident ident]]
>>
>>
>> (defn match-num
>>   [cs]
>>   (if (digit? (first cs))
>>
>> (let [ numseq (take-while digit? cs)
>>^String numstr (apply str numseq)
>>
>>retseq (drop (.length numstr) cs)]
>>   (if (= (first retseq) \.)
>>
>> nil
>> [retseq [:number numstr]]
>>
>> (defn match-ws
>>
>>   [cs]
>>   (if (whitespace-char? (first cs))
>> (let [ wsseq (take-while whitespace-char? cs)
>>
>>^String wsstr (apply str wsseq)
>>retseq (drop (.length wsstr) cs)]
>>
>>   [retseq [:ws wsstr]])))
>>
>>
>> ...
>>
>> (defn next-token
>>
>>   [cs]
>>   (or (match-ident cs)
>>   (match-ws cs)
>>
>>   (match-punct cs)
>>   (match-num cs)
>>   (match-eof cs)
>>
>>   (match-unknown cs)))
>>
>> ;; Here I build the lazy seq of tokens.
>>
>> (defn token-seq
>>   [cs]
>>
>>   (let [[newcs tok] (next-token cs)]
>> (lazy-seq (cons tok (token-seq newcs)
>>
>>
>> Cheers,
>> Andrew Chambers
>>
>>
>>  -- 
>> You received this message because you are subscribed to the Google
>> Groups "Clojure" group.
>> To post to this group, send email to clo...@googlegroups.com
>> Note that posts from new members are moderated - please be patient with 
>> your first post.
>> To unsubscribe from this group, send email to
>> clojure+u...@googlegroups.com 
>> For more options, visit this group at
>> http://groups.google.com/group/clojure?hl=en
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "Clojure" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to clojure+u...@googlegroups.com .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> -- 
> Kind Regards,
> Atamert Ölçgen
>
> -+-
> --+
> +++
>
> www.muhuk.com
>  

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Idiomatic tokenizing and performance

2014-05-04 Thread Atamert Ölçgen

I created a gist of your code for better readability, I hope you don't mind.

https://gist.github.com/muhuk/7c4a2b8db63886e2a9cd


On Mon, May 5, 2014 at 12:36 PM, Andrew Chambers
wrote:

> I've been trying to make a tokenizer/lexer for a project of mine and came
> up with the following code,
> I've modelled the stream of characters as seq/lazy of chars which is then
> converted to a lazy-seq of token objects.
> I'm relatively happy with how idiomatic and functional the code seems,
> however when benchmarked, the code takes about 30 seconds on clojure (after
> i increase the heap to 1 gig)
> to process a 30 meg file, and over 1 minute 30 seconds with clojurescript.
> This is in contrast to about of 0.1 to 0.5 seconds or less in C. Is
> there any idiomatic way to process the file without being a factor of 100
> times slower than C?
>
> Also, is there a tool for clojure similar to gprof for C?
>
>
> Each function takes in a char seq and returns both a token and the seq
> after its been advanced.
>
> (defn match-ident
>   [cs]
>   (let [start (first cs)]
> (if (ident-first-char? start)
>   (let [ identseq (cons start (take-while ident-tail-char? (rest cs)))
>  ^String ident (apply str identseq)]
> [(drop (.length ident) cs) [:ident ident]]
>
> (defn match-num
>   [cs]
>   (if (digit? (first cs))
> (let [ numseq (take-while digit? cs)
>^String numstr (apply str numseq)
>retseq (drop (.length numstr) cs)]
>   (if (= (first retseq) \.)
> nil
> [retseq [:number numstr]]
>
> (defn match-ws
>   [cs]
>   (if (whitespace-char? (first cs))
> (let [ wsseq (take-while whitespace-char? cs)
>^String wsstr (apply str wsseq)
>retseq (drop (.length wsstr) cs)]
>   [retseq [:ws wsstr]])))
>
>
> ...
>
> (defn next-token
>   [cs]
>   (or (match-ident cs)
>   (match-ws cs)
>   (match-punct cs)
>   (match-num cs)
>   (match-eof cs)
>   (match-unknown cs)))
>
> ;; Here I build the lazy seq of tokens.
>
> (defn token-seq
>   [cs]
>   (let [[newcs tok] (next-token cs)]
> (lazy-seq (cons tok (token-seq newcs)
>
>
> Cheers,
> Andrew Chambers
>
>
>  --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to clojure+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>



-- 
Kind Regards,
Atamert Ölçgen

-+-
--+
+++

www.muhuk.com

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Idiomatic tokenizing and performance

2014-05-04 Thread Andrew Chambers

I've been trying to make a tokenizer/lexer for a project of mine and came 
up with the following code,
I've modelled the stream of characters as seq/lazy of chars which is then 
converted to a lazy-seq of token objects.
I'm relatively happy with how idiomatic and functional the code seems, 
however when benchmarked, the code takes about 30 seconds on clojure (after 
i increase the heap to 1 gig)
to process a 30 meg file, and over 1 minute 30 seconds with clojurescript. 
This is in contrast to about of 0.1 to 0.5 seconds or less in C. Is
there any idiomatic way to process the file without being a factor of 100 
times slower than C?

Also, is there a tool for clojure similar to gprof for C? 


Each function takes in a char seq and returns both a token and the seq 
after its been advanced.

(defn match-ident
  [cs]
  (let [start (first cs)]
(if (ident-first-char? start)
  (let [ identseq (cons start (take-while ident-tail-char? (rest cs)))
 ^String ident (apply str identseq)]
[(drop (.length ident) cs) [:ident ident]]

(defn match-num
  [cs]
  (if (digit? (first cs))
(let [ numseq (take-while digit? cs)
   ^String numstr (apply str numseq)
   retseq (drop (.length numstr) cs)]
  (if (= (first retseq) \.)
nil
[retseq [:number numstr]]

(defn match-ws
  [cs]
  (if (whitespace-char? (first cs))
(let [ wsseq (take-while whitespace-char? cs)
   ^String wsstr (apply str wsseq)
   retseq (drop (.length wsstr) cs)]
  [retseq [:ws wsstr]])))


...

(defn next-token
  [cs]
  (or (match-ident cs)
  (match-ws cs)
  (match-punct cs)
  (match-num cs)
  (match-eof cs)
  (match-unknown cs)))

;; Here I build the lazy seq of tokens.

(defn token-seq
  [cs]
  (let [[newcs tok] (next-token cs)]
(lazy-seq (cons tok (token-seq newcs)


Cheers,
Andrew Chambers


-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Idiomatic tokenizing and performance

Re: Idiomatic tokenizing and performance

Idiomatic tokenizing and performance

3 matches

Site Navigation

Mail list logo

Footer information