Very cool.

I read through your README (thank you for that), and did not notice an
answer to the question of: does seq on a utf8 string return a sequence of
Unicode code points?  Java UTF-16 code points (with pairs of them for
characters outside the BMP)?  Something else?  It would be great to have
the answer in the README.

One idea I had for such a thing was: if any operation ever traversed part
of such a variable-bytes-per-char string for any reason (e.g. counting its
length in Unicode code points, or indexing), maintain some kind of data
structure mapping a few selected index values to their byte offset within
the byte array.  For example, a string containing 100 Unicode code points
might have byte offsets for the start of the UTF-8 encodings of every 32
code points, or 64.  This limits any sequential scanning to be from the
most recent cached byte offset.  Not a trivial amount of implementation
work, I know, but would be cool.



On Thu, Nov 7, 2013 at 11:41 AM, Paul Stadig <p...@stadig.name> wrote:

> I have released a byte vector backed, utf8 string library for Clojure. It
> exists because when doing lots of ASCII string processing, Java strings can
> be pretty hefty. Using byte arrays to represent strings can be annoying for
> various reasons (mutability, bad equality semantics). However, byte vectors
> are persistent data structures that store byte values efficiently. Some
> interesting points:
>
> * these strings implement the CharSequence interface, so you can use most
> every clojure.string function with them, and you can match regular
> expressions against them, which is pretty cool.
> * in addition to the efficient storage, since they are persistent data
> structures you also get structure sharing
> * the library includes a "StringWriter" that will write directly to a utf8
> string, so you don't even have to go through Java Strings if you don't want
> to
> * utf8 strings can be seq'ed and are lazily decoded as you traverse them
> * also because they are persistent data structures you can use conj and
> into with them
>
> I think that's most of the salient features. The README has more details
> and examples.
>
>
> https://github.com/pjstadig/utf8
>
>
> Cheers,
> Paul
>
> --
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to clojure+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to