Ok, thanks. The more I think about it it, the less beneficial it looks to me.
JensG ----- Ursprüngliche Nachricht ----- Von: Randy Abernethy Gesendet: 01.01.2016 18:18 An: dev@thrift.apache.org Betreff: Re: UTF-16 Right now the "string" type in Apache Thrift is abstract. I like that. If you are using Java or C# then Apache Thrift strings are UTF-16. If you are using Python or Go, then Apache Thrift strings are UTF-8. So the protocols are already serializing between the language native string type and UTF-8 on the wire. Adding UTF-16 as an additional wire protocol would have pros and cons: CONS (all counter to the efficiency goal in addition to adding complexity): - Equal or Larger payload, with the exception of strings heavy on eastern Asia characters - Byte order becomes a factor * If we define a byte order some platforms might have to swap every pair of bytes (though this is very cheap it is not free, particularly for large strings) * If we don't pick a byte order we need to negotiate the byte order (BOM probably? which would noticeably increase the size of small strings and require byte swapping on opposing platforms) PROS: - For like endian and encoding systems, strings could be serialized directly. * Important to note that Compact protocol does not compress strings. Losing the typical compression associated with going from UTF-16 -> UTF-8 might make payloads larger and slower. Compact protocol has always traded CPU for size, which it does today when converting UTF-16 to UTF-8 on the wire (though as noted there are cases where the UTF-8 string is longer). If folks think this is a path worth exploring it would be nice to see some empirical test cases as a next step. Are the trades worth making? On Fri, Jan 1, 2016 at 7:46 AM, Ben Craig <ben.cr...@gmail.com> wrote: > I don't like the idea of adding a new utf-16 string type to the wire > protocol, but I think it would be fine to add a utf-16 string type to the > language bindings. UTF-8 would be sent over the wire, and then converted > from the network buffer into the user's desired string type. A lot of the > cost and inconvenience of utf-8 and utf-16 is just dealing with all the > conversions, and Thrift seems like a reasonable place to remove one of > those conversions. > > On Fri, Jan 1, 2016 at 4:01 AM, Jens Geyer <jensge...@hotmail.com> wrote: > > > Yes, that was the question. It could eliminate some conversions from and > > to utf8 (speed is a Thrift goal) but I'm not sure if the possible gains > are > > worth doing it. > > > > Re keeping it simple: I fully agree, absolutely. But we have 4 integer > > types and there are thoughts to integrate floats as well ... > > > > Happy new year! > > ________________________________ > > Von: Randy Abernethy > > Gesendet: 01.01.2016 02:56 > > An: dev@thrift.apache.org > > Betreff: Re: UTF-16 > > > > Hey David, > > > > Apache Thrift has a "string" type in its IDL and that type is a language > > native string in the generated code but is UTF-8 on the wire when using > > binary, compact or JSON protocols by default. > > > > I think Jens is posing the question (correct me if I'm wrong Jens): > Should > > we also support UTF-16 string encoding on the wire with binary, compact > and > > JSON protocols. > > > > -Randy > > > > On Thu, Dec 31, 2015 at 5:09 PM, David Bennett <da...@yorkage.com> > wrote: > > > > > >>>while UTF-8 is great, especially on Windows platforms UTF-16 is more > > > common, because the OS uses it heavily internally. Since Win2k it also > > > supports surrogates and supplementary characters. So there’s OS support > > for > > > it. What I don’t know is, how universally is UTF-16 (or a subset of it) > > > supported across other platforms? Can we assume a certain degree of > > support > > > on all the various platforms that Thrift can run on? > > > > > > >>>TL;DR: Would it make sense to add UTF-16 as another string format > > type? > > > > > > In my opinion, no. This is based on a mistaken understanding or > > > expectation. > > > > > > Thrift currently supports a string of bytes as a type, and users who > wish > > > to exchange character string data are expected to impose some kind of > > > meaning on top of that. > > > > > > What Thrift needs is a genuine string data type, independent of any > > > particular transport format, and which fully supports Unicode code > > points. > > > The transport mechanism could be UTF-8, UTF-16, UTF-32 or variable > length > > > (zigzag) integers (currently Unicode requires about 21 bits). > > > > > > User libraries would of course be free to reformat those Unicode > strings > > > into any format comfortably supported by the platform. On Windows > UTF-16 > > is > > > preferred, but should never be viewed as something different from the > > > underlying Unicode string. > > > > > > Regards > > > David M Bennett FACS > > > > > > Andl - A New Database Language - andl.org > > > > > > > > > > > > > > > > > > > > >