pyarrow at least treats the KeyValue values as binary and not UTF-8.

On Sun, Jul 11, 2021 at 9:40 PM Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> I think other languages (e.g. java, python) might make more of distinction 
> between utf-8 compatible strings and raw bytes.  For python it might be less 
> of a concern if the c++ wrapper already makes the value field look like a 
> bytes field
>
> On Sunday, July 11, 2021, Wes McKinney <wesmck...@gmail.com> wrote:
>>
>> We could certainly "upgrade" KeyValue to have a binary value field
>> everywhere KeyValue is used, but there is some risk of code in the
>> wild expecting there to be a null terminator after the string data.
>> The Flatbuffers-generated accessor APIs do not depend on the existence
>> of the null terminator, though. Not ideal, but I would not be thrilled
>> about adding an extra [ BinaryKeyValue ] everyplace we currently have
>> [ KeyValue ].
>>
>> That said, I doubt that we have any endogenous forward compatibility
>> problems related to this in Apache Arrow-maintained libraries, the
>> risk would come from users who are interacting with the Flatbuffers
>> data manually / without using one of our libraries. We could implement
>> the changes and run a set of forward compatibility integration tests
>> to see if anyone of our released libraries have an issue.
>>
>> On Fri, Jul 9, 2021 at 11:33 AM Wes McKinney <wesmck...@gmail.com> wrote:
>> >
>> > The cost of an empty vector in Flatbuffers appears to be 4 bytes.
>> >
>> > On Wed, Jul 7, 2021 at 5:50 PM Micah Kornfield <emkornfi...@gmail.com> 
>> > wrote:
>> > >
>> > > Retitling and forking the discussion to talk about key value pairs.
>> > >
>> > > What is the byte cost of an empty list?  Another option would be to
>> > > introduce a new BinaryKeyValue table and add binary metadata.
>> > >
>> > > On Wed, Jul 7, 2021 at 8:32 AM Nate Bauernfeind <
>> > > natebauernfe...@deephaven.io> wrote:
>> > >
>> > > > Deephaven and I are very supportive of "upgrading" the value half of 
>> > > > the kv
>> > > > pair to a byte vector. What is the best way to find out if there is
>> > > > sufficient interest?
>> > > >
>> > > >
>> > > > I've been stewing on the ideas here around schema evolution, and I 
>> > > > realize
>> > > > the specific feature I am missing is the ability to encode that a field
>> > > > (i.e. its FieldNode and accompanying Buffers in the RecordBatch) is
>> > > > empty/has-no-data in O(0) cost (yes; for free).
>> > > >
>> > > > Might there be interest in adding a "field_id" to the FieldNode (which 
>> > > > is
>> > > > encoded on the RecordBatch flatbuffer)? I see a simple 
>> > > > forward-compatible
>> > > > upgrade (by either keying off of 0, or explicitly set the field 
>> > > > default to
>> > > > -1) which would allow the sender to "skip" fields that have 1) 
>> > > > FieldNode
>> > > > length of zero, and 2) all Buffer's associated at that level (and 
>> > > > further
>> > > > nested) are also equally empty (i.e. Buffer length is zero).
>> > > >
>> > > > I understand this concept slightly interferes with RecordBatch's 
>> > > > `length`
>> > > > field, and that many implementations use that length to resize the
>> > > > root-level FieldNodes. The use-case I have in mind has different 
>> > > > logical
>> > > > lengths per field node; current implementations require sending a
>> > > > RecordBatch length of the max length across all root level field 
>> > > > nodes. I
>> > > > believe this requires a copy of data whenever a field node is too 
>> > > > short; I
>> > > > don't know if there is a decent solution to this slight inefficiency. 
>> > > > I am
>> > > > bringing it up because if "skipping a field node when it is empty" is a
>> > > > feature, then we may not want to allocate space for those nodes given 
>> > > > that
>> > > > the record batch length will likely be greater than zero.
>> > > >
>> > > > On Wed, Jul 7, 2021 at 8:12 AM Wes McKinney <wesmck...@gmail.com> 
>> > > > wrote:
>> > > >
>> > > > > On Wed, Jul 7, 2021 at 2:53 PM David Li <apa...@lidavidm.me> wrote:
>> > > > > >
>> > > > > > From the Flatbuffers internals doc[1] it appears they are the same:
>> > > > > "Strings are simply a vector of bytes, and are always 
>> > > > > null-terminated."
>> > > > >
>> > > > > I see. I took a look at flatbuffers.h, and it appears that changing
>> > > > > this field from string to [byte] would be backward-compatible and
>> > > > > forward-compatible except with code that expects a null terminator.
>> > > > > This is something we could discuss separately if there were enough
>> > > > > interest.
>> > > > >
>> > > > > > [1]: 
>> > > > > > https://google.github.io/flatbuffers/flatbuffers_internals.html
>> > > > > >
>> > > > > > -David
>> > > > > >
>> > > > > > On Wed, Jul 7, 2021, at 05:08, Wes McKinney wrote:
>> > > > > > > On Tue, Jul 6, 2021 at 6:33 PM Micah Kornfield <
>> > > > emkornfi...@gmail.com>
>> > > > > wrote:
>> > > > > > > >
>> > > > > > > > >
>> > > > > > > > > Right, I had wanted to focus the discussion on Flight as I 
>> > > > > > > > > think
>> > > > > schema
>> > > > > > > > > evolution or multiplexing streams (more so the latter) is a
>> > > > > property of the
>> > > > > > > > > transport and not the stream format itself. If we are leaning
>> > > > > towards just
>> > > > > > > > > schema evolution then maybe it makes sense to discuss it for 
>> > > > > > > > > the
>> > > > > IPC stream
>> > > > > > > > > format and leverage that in Flight. I'd be interested in what
>> > > > > others think.
>> > > > > > > >
>> > > > > > > > I tend to agree, I think stream multiplexing is likely a 
>> > > > > > > > transport
>> > > > > level
>> > > > > > > > issue.  IMO I think schema evolution should be consistent with 
>> > > > > > > > the
>> > > > > IPC
>> > > > > > > > stream format  and flight.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > > Nate: it may be worth starting a separate discussion about 
>> > > > > > > > > more
>> > > > > general
>> > > > > > > > > metadata in the IPC message. I'm not aware of why key-value
>> > > > > metadata was
>> > > > > > > > > chosen/if opaque bytes were considered in the past.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > I think  this was an unfortunate design of the key value 
>> > > > > > > > metadata
>> > > > in
>> > > > > > > > Schema.fbs, but I don't think I was around when this decision 
>> > > > > > > > was
>> > > > > made.
>> > > > > > >
>> > > > > > > I agree that it's unfortunate that we did not use [ byte ] 
>> > > > > > > instead of
>> > > > > > > string for the value in the KeyValue metadata — I think this was 
>> > > > > > > more
>> > > > > > > of an oversight than a deliberate choice (e.g. it was not our 
>> > > > > > > intent
>> > > > > > > to require binary data to be base64-encoded — this is something 
>> > > > > > > that
>> > > > > > > we have to do when encoding binary data in Thrift KeyValue 
>> > > > > > > metadata
>> > > > > > > for Parquet, for example). Is the binary representation of [byte]
>> > > > > > > different from string?
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > > Side Question: Why isn't the IPC stream format a series of the
>> > > > flight
>> > > > > > > > > protobufs?
>> > > > > > > >
>> > > > > > > > In addition to what David said, protobufs can't be read 
>> > > > > > > > directly
>> > > > > from a
>> > > > > > > > memory-mapped file (they need decoding).  This was one of the
>> > > > design
>> > > > > > > > considerations of using flatbuffers and IPC Stream/File format.
>> > > > > > > >
>> > > > > > > > I was thinking Micah's comment is more that whatever we do, it
>> > > > > should be
>> > > > > > > > > clearly specified and edge cases should be considered, 
>> > > > > > > > > especially
>> > > > > if we
>> > > > > > > > > might want to 'backport' this into the stream format later.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > Yes, for dictionaries we just need to be careful to define
>> > > > semantics
>> > > > > and
>> > > > > > > > ensure implementations are validating them with regards to
>> > > > > dictionaries.
>> > > > > > > > There likely isn't any need to change current implementations
>> > > > though.
>> > > > > > > >
>> > > > > > > > On Mon, Jun 28, 2021 at 1:25 PM David Li <lidav...@apache.org>
>> > > > > wrote:
>> > > > > > > >
>> > > > > > > > > Right, I had wanted to focus the discussion on Flight as I 
>> > > > > > > > > think
>> > > > > schema
>> > > > > > > > > evolution or multiplexing streams (more so the latter) is a
>> > > > > property of the
>> > > > > > > > > transport and not the stream format itself. If we are leaning
>> > > > > towards just
>> > > > > > > > > schema evolution then maybe it makes sense to discuss it for 
>> > > > > > > > > the
>> > > > > IPC stream
>> > > > > > > > > format and leverage that in Flight. I'd be interested in what
>> > > > > others think.
>> > > > > > > > >
>> > > > > > > > > Especially if we are looking at multiplexing streams - I 
>> > > > > > > > > would
>> > > > > wonder if
>> > > > > > > > > that's actually better served by making it easier to 
>> > > > > > > > > implement
>> > > > > using the
>> > > > > > > > > Flight implementation as it stands (by managing concurrent 
>> > > > > > > > > RPC
>> > > > > calls and/or
>> > > > > > > > > performing the union-of-structs encoding trick for you), 
>> > > > > > > > > instead
>> > > > > of having
>> > > > > > > > > to change the protocol.
>> > > > > > > > >
>> > > > > > > > > Nate: it may be worth starting a separate discussion about 
>> > > > > > > > > more
>> > > > > general
>> > > > > > > > > metadata in the IPC message. I'm not aware of why key-value
>> > > > > metadata was
>> > > > > > > > > chosen/if opaque bytes were considered in the past. Off the 
>> > > > > > > > > top
>> > > > of
>> > > > > my head
>> > > > > > > > > if it's for on-disk storage and fully application-defined it 
>> > > > > > > > > may
>> > > > > make sense
>> > > > > > > > > to store as a separate file alongside the Arrow file 
>> > > > > > > > > (indexed by
>> > > > > record
>> > > > > > > > > batch index) where you can take advantage of whatever format 
>> > > > > > > > > is
>> > > > > most
>> > > > > > > > > suitable.
>> > > > > > > > >
>> > > > > > > > > -David
>> > > > > > > > >
>> > > > > > > > > On Sun, Jun 27, 2021, at 07:50, Gosh Arzumanyan wrote:
>> > > > > > > > > > Hi guys,
>> > > > > > > > > >
>> > > > > > > > > > 1. Regarding IPC vs Flight: in fact my initial suggestion 
>> > > > > > > > > > was
>> > > > to
>> > > > > add this
>> > > > > > > > > > feature starting from the IPC(I moved initial write up 
>> > > > > > > > > > steps to
>> > > > > the
>> > > > > > > > > bottom
>> > > > > > > > > > of the doc). Afterwards David suggested focusing on Flight 
>> > > > > > > > > > and
>> > > > > that's how
>> > > > > > > > > > we ended up with the protobufs change in the proposal. This
>> > > > > being said I
>> > > > > > > > > do
>> > > > > > > > > > think that the place where this should be impemented is a 
>> > > > > > > > > > good
>> > > > > question
>> > > > > > > > > on
>> > > > > > > > > > its own. Maybe it makes sense to have this kind of a 
>> > > > > > > > > > feature in
>> > > > > IPC and
>> > > > > > > > > > somehow use it in Flight, maybe not.
>> > > > > > > > > > 2. The point about dictionaries deserves a dedicated 
>> > > > > > > > > > section in
>> > > > > the
>> > > > > > > > > > proposal. Nate and David brought it up and shared some
>> > > > insights.
>> > > > > I'll try
>> > > > > > > > > > to aggregate them and we can continue the discussion form
>> > > > there.
>> > > > > > > > > >
>> > > > > > > > > > Cheers,
>> > > > > > > > > > Gosh
>> > > > > > > > > >
>> > > > > > > > > > On Sat., 26 Jun. 2021, 17:26 Nate Bauernfeind, <
>> > > > > > > > > natebauernfe...@deephaven.io>
>> > > > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > > > > makes it more difficult to bring schema evolution 
>> > > > > > > > > > > > > > back
>> > > > > into the
>> > > > > > > > > > > > > > IPC Stream format (i.e. it would live only in 
>> > > > > > > > > > > > > > flight)
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Gosh's proposal extends the flatbuffer structures 
>> > > > > > > > > > > > > not the
>> > > > > > > > > protobufs.
>> > > > > > > > > > > Can
>> > > > > > > > > > > > > you help me understand how difficult it would be to 
>> > > > > > > > > > > > > bring
>> > > > > the
>> > > > > > > > > > > `schema_id`
>> > > > > > > > > > > > > approach to the IPC stream format?
>> > > > > > > > > > > >
>> > > > > > > > > > > > I thought we were talking solely about the Flight 
>> > > > > > > > > > > > Protobuf
>> > > > > > > > > definitions -
>> > > > > > > > > > > > not the Flatbuffers (and the Google doc at least only 
>> > > > > > > > > > > > talks
>> > > > > about the
>> > > > > > > > > > > > Protobufs).
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > I somehow missed that schema_id is being added to 
>> > > > > > > > > > > protobuf in
>> > > > > the
>> > > > > > > > > document.
>> > > > > > > > > > > It feels to me that the schema_id is a property that 
>> > > > > > > > > > > would
>> > > > > ideally only
>> > > > > > > > > > > apply to the RecordBatch. I better understand Micah's
>> > > > > dictionary
>> > > > > > > > > concerns,
>> > > > > > > > > > > now, too.
>> > > > > > > > > > >
>> > > > > > > > > > > > Side Question: Why isn't the IPC stream format a 
>> > > > > > > > > > > > series of
>> > > > > the flight
>> > > > > > > > > > > > > protobufs? It's a real shame that there is no 
>> > > > > > > > > > > > > standard
>> > > > way
>> > > > > to
>> > > > > > > > > > > > > capture/replay a stream with app_metadata. (Obviously
>> > > > > ignoring the
>> > > > > > > > > > > > > annoyances around protobuf wrapping flatbuffers.)
>> > > > > > > > > > > >
>> > > > > > > > > > > > The IPC format was defined long before Flight, and 
>> > > > > > > > > > > > Flight's
>> > > > > > > > > app_metadata
>> > > > > > > > > > > > was added after Flight's initial definition. Note an 
>> > > > > > > > > > > > IPC
>> > > > > message does
>> > > > > > > > > > > have
>> > > > > > > > > > > > a provision for key-value metadata, though I think 
>> > > > > > > > > > > > APIs for
>> > > > > that are
>> > > > > > > > > not
>> > > > > > > > > > > > fully exposed. (See ARROW-6940:
>> > > > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-6940 and
>> > > > > despite my
>> > > > > > > > > comments
>> > > > > > > > > > > > there perhaps we need to unify or at least consider how
>> > > > > Flight's
>> > > > > > > > > > > > app_metadata relates to the IPC message 
>> > > > > > > > > > > > custom_metadata.
>> > > > Also
>> > > > > > > > > perhaps see
>> > > > > > > > > > > > ARROW-1059.)
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > KeyValue unfortunately is string to string. In flatbuffer
>> > > > > strings are
>> > > > > > > > > only
>> > > > > > > > > > > UTF-8 or 7-bit ASCII. The app_metadata on the other hand 
>> > > > > > > > > > > is
>> > > > > opaque
>> > > > > > > > > bytes.
>> > > > > > > > > > > The latter is a bit more useful.
>> > > > > > > > > > >
>> > > > > > > > > > > --
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > >
>> > > > >
>> > > >
>> > > >
>> > > > --
>> > > >

Reply via email to