On Thu, 29 Sep 2022 15:19:59 -0400 Larry White <ljw1...@gmail.com> wrote: > Interesting. This doesn't seem to be a Java issue, per se then. I've seen > admonations in various Arrow Java threads to always specify the Charset for > the conversion - and so assumed more than one Charset was legal - and have > written Arrow Java test code that uses other charsets without ill effect. > > I've never attempted to transport that data over the wire or export it > using the C-Data Interface, however. It seems like that's where it would > fall down.
For performance, most consumers of Arrow data would not necessarily check that it's valid utf-8. They would however definitely misinterpret it. The "string" (also called "utf8" in some implementations) data type is definitely specified as being valid utf-8. Given the dwindling popularity of utf-16 and the growing universality of utf-8, I don't think it would be a good idea to add another datatype for it. However, an extension type would be doable. I think a step back is needed first: what is the use case for transporting utf-16 data in Arrow? Regards Antoine. > > On Thu, Sep 29, 2022 at 3:01 PM James Henderson <j...@juxt.pro> wrote: > > > FWIW we'd made a similar assumption. In Schema.fbs [1] the type is called > > Utf8, as well as the Java `ArrowType.Utf8` class - is this a required > > assumption to work with other language Arrow libs, maybe? > > > > James > > > > [1] https://github.com/apache/arrow/blob/master/format/Schema.fbs > > > > On Thu, 29 Sept 2022 at 18:57, Larry White <ljw1...@gmail.com> wrote: > > > > > Hi Kevin, > > > > > > I don't know of any particular restriction regarding string encoding. > > > VarCharVector stores data as a byte array, and the encoding can be set > > > using the Charset class when you convert Strings to and from bytes. Since > > > java strings use UTF-16 internally, I would expect this to 'just work'. > > > > > > larry > > > > > > On Thu, Sep 29, 2022 at 12:46 PM Kevin Bambrick < > > kevinbambri...@gmail.com> > > > wrote: > > > > > > > Hi. > > > > > > > > Was just wondering was support for UTF-16 Strings considered? As far > > as I > > > > am aware VarChar vectors only support UTF-8. Are they something that > > may > > > be > > > > supported in the future? > > > > > > > > Regards. > > > > Kevin. > > > > > > > > > > > > > -- > > *James Henderson* > > XTDB Development Manager at *JUXT* > > > > Email j...@juxt.pro > > Website https://juxt.pro > > > > [image: photo] > > >