On Thu, 29 Sep 2022 15:19:59 -0400
Larry White <ljw1...@gmail.com> wrote:
> Interesting. This doesn't seem to be a Java issue, per se then. I've seen
> admonations in various Arrow Java threads to always specify the Charset for
> the conversion - and so assumed more than one Charset was legal - and have
> written Arrow Java test code that uses other charsets without ill effect.
> 
> I've never attempted to transport that data over the wire or export it
> using the C-Data Interface, however. It seems like that's where it would
> fall down.

For performance, most consumers of Arrow data would not necessarily
check that it's valid utf-8. They would however definitely misinterpret
it.

The "string" (also called "utf8" in some implementations) data type is
definitely specified as being valid utf-8.

Given the dwindling popularity of utf-16 and the growing universality
of utf-8, I don't think it would be a good idea to add another datatype
for it. However, an extension type would be doable.

I think a step back is needed first: what is the use case for
transporting utf-16 data in Arrow?

Regards

Antoine.



> 
> On Thu, Sep 29, 2022 at 3:01 PM James Henderson <j...@juxt.pro> wrote:
> 
> > FWIW we'd made a similar assumption. In Schema.fbs [1] the type is called
> > Utf8, as well as the Java `ArrowType.Utf8` class - is this a required
> > assumption to work with other language Arrow libs, maybe?
> >
> > James
> >
> > [1] https://github.com/apache/arrow/blob/master/format/Schema.fbs
> >
> > On Thu, 29 Sept 2022 at 18:57, Larry White <ljw1...@gmail.com> wrote:
> >  
> > > Hi Kevin,
> > >
> > > I don't know of any particular restriction regarding string encoding.
> > > VarCharVector stores data as a byte array, and the encoding can be set
> > > using the Charset class when you convert Strings to and from bytes. Since
> > > java strings use UTF-16 internally, I would expect this to 'just work'.
> > >
> > > larry
> > >
> > > On Thu, Sep 29, 2022 at 12:46 PM Kevin Bambrick <
> > kevinbambri...@gmail.com>
> > > wrote:
> > >  
> > > > Hi.
> > > >
> > > > Was just wondering was support for UTF-16 Strings considered? As far  
> > as I  
> > > > am aware VarChar vectors only support UTF-8. Are they something that  
> > may  
> > > be  
> > > > supported in the future?
> > > >
> > > > Regards.
> > > > Kevin.
> > > >  
> > >  
> >
> >
> > --
> > *James Henderson*
> > XTDB Development Manager at *JUXT*
> >
> > Email j...@juxt.pro
> > Website https://juxt.pro
> >
> > [image: photo]
> >  
> 



Reply via email to