Very cool. I read through your README (thank you for that), and did not notice an answer to the question of: does seq on a utf8 string return a sequence of Unicode code points? Java UTF-16 code points (with pairs of them for characters outside the BMP)? Something else? It would be great to have the answer in the README.
One idea I had for such a thing was: if any operation ever traversed part of such a variable-bytes-per-char string for any reason (e.g. counting its length in Unicode code points, or indexing), maintain some kind of data structure mapping a few selected index values to their byte offset within the byte array. For example, a string containing 100 Unicode code points might have byte offsets for the start of the UTF-8 encodings of every 32 code points, or 64. This limits any sequential scanning to be from the most recent cached byte offset. Not a trivial amount of implementation work, I know, but would be cool. On Thu, Nov 7, 2013 at 11:41 AM, Paul Stadig <p...@stadig.name> wrote: > I have released a byte vector backed, utf8 string library for Clojure. It > exists because when doing lots of ASCII string processing, Java strings can > be pretty hefty. Using byte arrays to represent strings can be annoying for > various reasons (mutability, bad equality semantics). However, byte vectors > are persistent data structures that store byte values efficiently. Some > interesting points: > > * these strings implement the CharSequence interface, so you can use most > every clojure.string function with them, and you can match regular > expressions against them, which is pretty cool. > * in addition to the efficient storage, since they are persistent data > structures you also get structure sharing > * the library includes a "StringWriter" that will write directly to a utf8 > string, so you don't even have to go through Java Strings if you don't want > to > * utf8 strings can be seq'ed and are lazily decoded as you traverse them > * also because they are persistent data structures you can use conj and > into with them > > I think that's most of the salient features. The README has more details > and examples. > > > https://github.com/pjstadig/utf8 > > > Cheers, > Paul > > -- > -- > You received this message because you are subscribed to the Google > Groups "Clojure" group. > To post to this group, send email to clojure@googlegroups.com > Note that posts from new members are moderated - please be patient with > your first post. > To unsubscribe from this group, send email to > clojure+unsubscr...@googlegroups.com > For more options, visit this group at > http://groups.google.com/group/clojure?hl=en > --- > You received this message because you are subscribed to the Google Groups > "Clojure" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to clojure+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/groups/opt_out. > -- -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.