Ok, thanks. The more I think about it it, the less beneficial it looks to me.

JensG


----- Ursprüngliche Nachricht -----
Von: Randy Abernethy
Gesendet: 01.01.2016 18:18
An: dev@thrift.apache.org
Betreff: Re: UTF-16

Right now the "string" type in Apache Thrift is abstract. I like that. If
you are using Java or C# then Apache Thrift strings are UTF-16. If you are
using Python or Go, then Apache Thrift strings are UTF-8. So the protocols
are already serializing between the language native string type and UTF-8
on the wire.



Adding UTF-16 as an additional wire protocol would have pros and cons:



CONS (all counter to the efficiency goal in addition to adding complexity):

 - Equal or Larger payload, with the exception of strings heavy on eastern
Asia characters

 - Byte order becomes a factor

          * If we define a byte order some platforms might have to swap
every pair of bytes (though this is very cheap it is not free, particularly
for large strings)

          * If we don't pick a byte order we need to negotiate the byte
order (BOM probably? which would noticeably increase the size of small
strings and require byte swapping on opposing platforms)



PROS:

 - For like endian and encoding systems, strings could be serialized
directly.

          * Important to note that Compact protocol does not compress
strings. Losing the typical compression associated with going from UTF-16
-> UTF-8 might make payloads larger and slower. Compact protocol has always
traded CPU for size, which it does today when converting UTF-16 to UTF-8 on
the wire (though as noted there are cases where the UTF-8 string is longer).





If folks think this is a path worth exploring it would be nice to see some
empirical test cases as a next step. Are the trades worth making?





On Fri, Jan 1, 2016 at 7:46 AM, Ben Craig <ben.cr...@gmail.com> wrote:

> I don't like the idea of adding a new utf-16 string type to the wire
> protocol, but I think it would be fine to add a utf-16 string type to the
> language bindings.  UTF-8 would be sent over the wire, and then converted
> from the network buffer into the user's desired string type.  A lot of the
> cost and inconvenience of utf-8 and utf-16 is just dealing with all the
> conversions, and Thrift seems like a reasonable place to remove one of
> those conversions.
>
> On Fri, Jan 1, 2016 at 4:01 AM, Jens Geyer <jensge...@hotmail.com> wrote:
>
> > Yes, that was the question. It could eliminate some conversions from and
> > to utf8 (speed is a Thrift goal) but I'm not sure if the possible gains
> are
> > worth doing it.
> >
> > Re keeping it simple: I fully agree, absolutely. But we have 4 integer
> > types and there are thoughts to integrate floats as well ...
> >
> > Happy new year!
> > ________________________________
> > Von: Randy Abernethy
> > Gesendet: 01.01.2016 02:56
> > An: dev@thrift.apache.org
> > Betreff: Re: UTF-16
> >
> > Hey David,
> >
> > Apache Thrift has a "string" type in its IDL and that type is a language
> > native string in the generated code but is UTF-8 on the wire when using
> > binary, compact or JSON protocols by default.
> >
> > I think Jens is posing the question (correct me if I'm wrong Jens):
> Should
> > we also support UTF-16 string encoding on the wire with binary, compact
> and
> > JSON protocols.
> >
> > -Randy
> >
> > On Thu, Dec 31, 2015 at 5:09 PM, David Bennett <da...@yorkage.com>
> wrote:
> >
> > > >>>while UTF-8 is great, especially on Windows platforms UTF-16 is more
> > > common, because the OS uses it heavily internally. Since Win2k it also
> > > supports surrogates and supplementary characters. So there’s OS support
> > for
> > > it. What I don’t know is, how universally is UTF-16 (or a subset of it)
> > > supported across other platforms? Can we assume a certain degree of
> > support
> > > on all the various platforms that Thrift can run on?
> > >
> > > >>>TL;DR: Would it make sense to add UTF-16 as another string format
> > type?
> > >
> > > In my opinion, no. This is based on a mistaken understanding or
> > > expectation.
> > >
> > > Thrift currently supports a string of bytes as a type, and users who
> wish
> > > to exchange character string data are expected to impose some kind of
> > > meaning on top of that.
> > >
> > > What Thrift needs is a genuine string data type, independent of any
> > > particular transport format, and which fully supports Unicode code
> > points.
> > > The transport mechanism could be UTF-8, UTF-16, UTF-32 or variable
> length
> > > (zigzag) integers (currently Unicode requires about 21 bits).
> > >
> > > User libraries would of course be free to reformat those Unicode
> strings
> > > into any format comfortably supported by the platform. On Windows
> UTF-16
> > is
> > > preferred, but should never be viewed as something different from the
> > > underlying Unicode string.
> > >
> > > Regards
> > > David M Bennett FACS
> > >
> > > Andl - A New Database Language - andl.org
> > >
> > >
> > >
> > >
> > >
> > >
> >
>

Reply via email to