Yes, first-words-regex2 is pretty much identical in performance to my
longer regex-less version. Thanks for the pointer! I was not familiar with
the use of regexp-match functions on an input port. It’s a little wild to
me how even using a string port, a general-purpose pattern matching
function can be just about as fast as a custom-built loop that knows
exactly what it wants. But the regex library has probably been pretty well
optimized by now.
raco test: (submod "test-func.rkt" test)
cpu time: 140 real time: 140 gc time: 9 ; first-words
cpu time: 117 real time: 117 gc time: 0 ; first-words-regex2
"She counted one two silently"
"She counted one two silently"
cpu time: 124 real time: 124 gc time: 0
cpu time: 122 real time: 122 gc time: 0
"Stop! she called. She was"
"Stop she called. She was"
cpu time: 345 real time: 346 gc time: 0
cpu time: 358 real time: 358 gc time: 2
"In a 2005 episode of"
"In a 2005 episode of"
On Thursday, March 21, 2019 at 12:39:24 AM UTC-5, Matthew Butterick wrote:
>
>
>
> On Mar 20, 2019, at 8:19 PM, Joel Dueck >
> wrote:
>
> The regular-expression version is slower, much more so, it seems, the
> larger the first txexpr that you give it.
>
> I am sure both functions could be made much faster. In particular the
> regex version matches all the words in each string, there is probably a
> better pattern that would stop after the first N words.
>
>
> Your regexp-based function is slower because of `regexp-match*`, which
> eagerly finds all the matches (whether you need them or not). Whereas the
> port-based function is faster because it works incrementally.
>
> But you can do both at the same time, by passing an input port as the
> argument to `regexp-match`. In this example, the pattern is matched
> incrementally, and if we don't get enough words, we incrementally process
> the next txexpr.
>
> (require racket/string)
> (define (first-words-regex2 txs n)
> (define words
> (let loop ([txs txs][n n])
> (define ip (open-input-string (tx-strs (car txs
> (define words (for*/list ([i (in-range n)]
> [bs (in-value (regexp-match #px"\\w+" ip))]
> #:break (not bs))
> (bytes->string/utf-8 (car bs
> (if (= (length words) n)
> words
> (append words (loop (cdr txs) (- n (length words)))
> (string-join words " "))
>
>
--
You received this message because you are subscribed to the Google Groups
"Pollen" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to pollenpub+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.