I agree with you any thoughts on a way forward for at least hardening the
spec (or should this be done at the same time as adding the new field)?

On Mon, Aug 16, 2021 at 1:45 AM Wes McKinney <wesmck...@gmail.com> wrote:

> I've been poking around the project, and I'm growing concerned that
> our use of the KeyValue field has already been non-compliant in many
> cases since we do not validate UTF8-ness. Since we also use KeyValue
> to handle opaque data serialization for extension types [1], the fact
> that the specification does not clarify that binary data (such as the
> output of ExtensionType::Serialize) must be base64-encoded or similar
> makes this a bit of a minefield at the moment.
>
> It seems that there are no particularly excellent solutions, and
> maintaining the status quo (having now identified these
> inconsistencies / vagueries in at least the C++ implementation) is
> probably not a good idea either. In Parquet, where we store a
> serialized binary Arrow schema, we have to base64-encode [2]
>
> [1]:
> https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata
> [2]:
> https://github.com/apache/arrow/blob/e990d177b1f1dec962315487682f613d46be573c/cpp/src/parquet/arrow/writer.cc#L423
>
> On Tue, Aug 10, 2021 at 3:27 PM Wes McKinney <wesmck...@gmail.com> wrote:
> >
> > Ah, that's definitely a no-go then (I believe we verify messages
> > unconditionally in C++). That's unfortunate (and I feel responsible
> > for missing this, but I suppose we had a lot of opportunities to fix
> > it prior to the 1.0.0 format version) — so to have actual binary
> > values (which was the intention in the first place for the metadata)
> > we would need to add a new metadata field.
> >
> > On Tue, Aug 10, 2021 at 6:53 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
> > >
> > > One issue with changing it to byte is it would effectively break any
> reader that is validating flatbuffer data, because flatbuffers verifies
> null termination [1].  While this might comply with forward compatibility
> guarantees it seems like a pretty large blast radius.
> > >
> > > [1]
> https://github.com/google/flatbuffers/blob/master/include/flatbuffers/flatbuffers.h#L2457
> > >
> > > On Mon, Jul 12, 2021 at 12:38 PM Wes McKinney <wesmck...@gmail.com>
> wrote:
> > >>
> > >> pyarrow at least treats the KeyValue values as binary and not UTF-8.
> > >>
> > >> On Sun, Jul 11, 2021 at 9:40 PM Micah Kornfield <
> emkornfi...@gmail.com> wrote:
> > >> >
> > >> > I think other languages (e.g. java, python) might make more of
> distinction between utf-8 compatible strings and raw bytes.  For python it
> might be less of a concern if the c++ wrapper already makes the value field
> look like a bytes field
> > >> >
> > >> > On Sunday, July 11, 2021, Wes McKinney <wesmck...@gmail.com> wrote:
> > >> >>
> > >> >> We could certainly "upgrade" KeyValue to have a binary value field
> > >> >> everywhere KeyValue is used, but there is some risk of code in the
> > >> >> wild expecting there to be a null terminator after the string data.
> > >> >> The Flatbuffers-generated accessor APIs do not depend on the
> existence
> > >> >> of the null terminator, though. Not ideal, but I would not be
> thrilled
> > >> >> about adding an extra [ BinaryKeyValue ] everyplace we currently
> have
> > >> >> [ KeyValue ].
> > >> >>
> > >> >> That said, I doubt that we have any endogenous forward
> compatibility
> > >> >> problems related to this in Apache Arrow-maintained libraries, the
> > >> >> risk would come from users who are interacting with the Flatbuffers
> > >> >> data manually / without using one of our libraries. We could
> implement
> > >> >> the changes and run a set of forward compatibility integration
> tests
> > >> >> to see if anyone of our released libraries have an issue.
> > >> >>
> > >> >> On Fri, Jul 9, 2021 at 11:33 AM Wes McKinney <wesmck...@gmail.com>
> wrote:
> > >> >> >
> > >> >> > The cost of an empty vector in Flatbuffers appears to be 4 bytes.
> > >> >> >
> > >> >> > On Wed, Jul 7, 2021 at 5:50 PM Micah Kornfield <
> emkornfi...@gmail.com> wrote:
> > >> >> > >
> > >> >> > > Retitling and forking the discussion to talk about key value
> pairs.
> > >> >> > >
> > >> >> > > What is the byte cost of an empty list?  Another option would
> be to
> > >> >> > > introduce a new BinaryKeyValue table and add binary metadata.
> > >> >> > >
> > >> >> > > On Wed, Jul 7, 2021 at 8:32 AM Nate Bauernfeind <
> > >> >> > > natebauernfe...@deephaven.io> wrote:
> > >> >> > >
> > >> >> > > > Deephaven and I are very supportive of "upgrading" the value
> half of the kv
> > >> >> > > > pair to a byte vector. What is the best way to find out if
> there is
> > >> >> > > > sufficient interest?
> > >> >> > > >
> > >> >> > > >
> > >> >> > > > I've been stewing on the ideas here around schema evolution,
> and I realize
> > >> >> > > > the specific feature I am missing is the ability to encode
> that a field
> > >> >> > > > (i.e. its FieldNode and accompanying Buffers in the
> RecordBatch) is
> > >> >> > > > empty/has-no-data in O(0) cost (yes; for free).
> > >> >> > > >
> > >> >> > > > Might there be interest in adding a "field_id" to the
> FieldNode (which is
> > >> >> > > > encoded on the RecordBatch flatbuffer)? I see a simple
> forward-compatible
> > >> >> > > > upgrade (by either keying off of 0, or explicitly set the
> field default to
> > >> >> > > > -1) which would allow the sender to "skip" fields that have
> 1) FieldNode
> > >> >> > > > length of zero, and 2) all Buffer's associated at that level
> (and further
> > >> >> > > > nested) are also equally empty (i.e. Buffer length is zero).
> > >> >> > > >
> > >> >> > > > I understand this concept slightly interferes with
> RecordBatch's `length`
> > >> >> > > > field, and that many implementations use that length to
> resize the
> > >> >> > > > root-level FieldNodes. The use-case I have in mind has
> different logical
> > >> >> > > > lengths per field node; current implementations require
> sending a
> > >> >> > > > RecordBatch length of the max length across all root level
> field nodes. I
> > >> >> > > > believe this requires a copy of data whenever a field node
> is too short; I
> > >> >> > > > don't know if there is a decent solution to this slight
> inefficiency. I am
> > >> >> > > > bringing it up because if "skipping a field node when it is
> empty" is a
> > >> >> > > > feature, then we may not want to allocate space for those
> nodes given that
> > >> >> > > > the record batch length will likely be greater than zero.
> > >> >> > > >
> > >> >> > > > On Wed, Jul 7, 2021 at 8:12 AM Wes McKinney <
> wesmck...@gmail.com> wrote:
> > >> >> > > >
> > >> >> > > > > On Wed, Jul 7, 2021 at 2:53 PM David Li <
> apa...@lidavidm.me> wrote:
> > >> >> > > > > >
> > >> >> > > > > > From the Flatbuffers internals doc[1] it appears they
> are the same:
> > >> >> > > > > "Strings are simply a vector of bytes, and are always
> null-terminated."
> > >> >> > > > >
> > >> >> > > > > I see. I took a look at flatbuffers.h, and it appears that
> changing
> > >> >> > > > > this field from string to [byte] would be
> backward-compatible and
> > >> >> > > > > forward-compatible except with code that expects a null
> terminator.
> > >> >> > > > > This is something we could discuss separately if there
> were enough
> > >> >> > > > > interest.
> > >> >> > > > >
> > >> >> > > > > > [1]:
> https://google.github.io/flatbuffers/flatbuffers_internals.html
> > >> >> > > > > >
> > >> >> > > > > > -David
> > >> >> > > > > >
> > >> >> > > > > > On Wed, Jul 7, 2021, at 05:08, Wes McKinney wrote:
> > >> >> > > > > > > On Tue, Jul 6, 2021 at 6:33 PM Micah Kornfield <
> > >> >> > > > emkornfi...@gmail.com>
> > >> >> > > > > wrote:
> > >> >> > > > > > > >
> > >> >> > > > > > > > >
> > >> >> > > > > > > > > Right, I had wanted to focus the discussion on
> Flight as I think
> > >> >> > > > > schema
> > >> >> > > > > > > > > evolution or multiplexing streams (more so the
> latter) is a
> > >> >> > > > > property of the
> > >> >> > > > > > > > > transport and not the stream format itself. If we
> are leaning
> > >> >> > > > > towards just
> > >> >> > > > > > > > > schema evolution then maybe it makes sense to
> discuss it for the
> > >> >> > > > > IPC stream
> > >> >> > > > > > > > > format and leverage that in Flight. I'd be
> interested in what
> > >> >> > > > > others think.
> > >> >> > > > > > > >
> > >> >> > > > > > > > I tend to agree, I think stream multiplexing is
> likely a transport
> > >> >> > > > > level
> > >> >> > > > > > > > issue.  IMO I think schema evolution should be
> consistent with the
> > >> >> > > > > IPC
> > >> >> > > > > > > > stream format  and flight.
> > >> >> > > > > > > >
> > >> >> > > > > > > >
> > >> >> > > > > > > > > Nate: it may be worth starting a separate
> discussion about more
> > >> >> > > > > general
> > >> >> > > > > > > > > metadata in the IPC message. I'm not aware of why
> key-value
> > >> >> > > > > metadata was
> > >> >> > > > > > > > > chosen/if opaque bytes were considered in the past.
> > >> >> > > > > > > >
> > >> >> > > > > > > >
> > >> >> > > > > > > > I think  this was an unfortunate design of the key
> value metadata
> > >> >> > > > in
> > >> >> > > > > > > > Schema.fbs, but I don't think I was around when this
> decision was
> > >> >> > > > > made.
> > >> >> > > > > > >
> > >> >> > > > > > > I agree that it's unfortunate that we did not use [
> byte ] instead of
> > >> >> > > > > > > string for the value in the KeyValue metadata — I
> think this was more
> > >> >> > > > > > > of an oversight than a deliberate choice (e.g. it was
> not our intent
> > >> >> > > > > > > to require binary data to be base64-encoded — this is
> something that
> > >> >> > > > > > > we have to do when encoding binary data in Thrift
> KeyValue metadata
> > >> >> > > > > > > for Parquet, for example). Is the binary
> representation of [byte]
> > >> >> > > > > > > different from string?
> > >> >> > > > > > >
> > >> >> > > > > > >
> > >> >> > > > > > >
> > >> >> > > > > > > > Side Question: Why isn't the IPC stream format a
> series of the
> > >> >> > > > flight
> > >> >> > > > > > > > > protobufs?
> > >> >> > > > > > > >
> > >> >> > > > > > > > In addition to what David said, protobufs can't be
> read directly
> > >> >> > > > > from a
> > >> >> > > > > > > > memory-mapped file (they need decoding).  This was
> one of the
> > >> >> > > > design
> > >> >> > > > > > > > considerations of using flatbuffers and IPC
> Stream/File format.
> > >> >> > > > > > > >
> > >> >> > > > > > > > I was thinking Micah's comment is more that whatever
> we do, it
> > >> >> > > > > should be
> > >> >> > > > > > > > > clearly specified and edge cases should be
> considered, especially
> > >> >> > > > > if we
> > >> >> > > > > > > > > might want to 'backport' this into the stream
> format later.
> > >> >> > > > > > > >
> > >> >> > > > > > > >
> > >> >> > > > > > > > Yes, for dictionaries we just need to be careful to
> define
> > >> >> > > > semantics
> > >> >> > > > > and
> > >> >> > > > > > > > ensure implementations are validating them with
> regards to
> > >> >> > > > > dictionaries.
> > >> >> > > > > > > > There likely isn't any need to change current
> implementations
> > >> >> > > > though.
> > >> >> > > > > > > >
> > >> >> > > > > > > > On Mon, Jun 28, 2021 at 1:25 PM David Li <
> lidav...@apache.org>
> > >> >> > > > > wrote:
> > >> >> > > > > > > >
> > >> >> > > > > > > > > Right, I had wanted to focus the discussion on
> Flight as I think
> > >> >> > > > > schema
> > >> >> > > > > > > > > evolution or multiplexing streams (more so the
> latter) is a
> > >> >> > > > > property of the
> > >> >> > > > > > > > > transport and not the stream format itself. If we
> are leaning
> > >> >> > > > > towards just
> > >> >> > > > > > > > > schema evolution then maybe it makes sense to
> discuss it for the
> > >> >> > > > > IPC stream
> > >> >> > > > > > > > > format and leverage that in Flight. I'd be
> interested in what
> > >> >> > > > > others think.
> > >> >> > > > > > > > >
> > >> >> > > > > > > > > Especially if we are looking at multiplexing
> streams - I would
> > >> >> > > > > wonder if
> > >> >> > > > > > > > > that's actually better served by making it easier
> to implement
> > >> >> > > > > using the
> > >> >> > > > > > > > > Flight implementation as it stands (by managing
> concurrent RPC
> > >> >> > > > > calls and/or
> > >> >> > > > > > > > > performing the union-of-structs encoding trick for
> you), instead
> > >> >> > > > > of having
> > >> >> > > > > > > > > to change the protocol.
> > >> >> > > > > > > > >
> > >> >> > > > > > > > > Nate: it may be worth starting a separate
> discussion about more
> > >> >> > > > > general
> > >> >> > > > > > > > > metadata in the IPC message. I'm not aware of why
> key-value
> > >> >> > > > > metadata was
> > >> >> > > > > > > > > chosen/if opaque bytes were considered in the
> past. Off the top
> > >> >> > > > of
> > >> >> > > > > my head
> > >> >> > > > > > > > > if it's for on-disk storage and fully
> application-defined it may
> > >> >> > > > > make sense
> > >> >> > > > > > > > > to store as a separate file alongside the Arrow
> file (indexed by
> > >> >> > > > > record
> > >> >> > > > > > > > > batch index) where you can take advantage of
> whatever format is
> > >> >> > > > > most
> > >> >> > > > > > > > > suitable.
> > >> >> > > > > > > > >
> > >> >> > > > > > > > > -David
> > >> >> > > > > > > > >
> > >> >> > > > > > > > > On Sun, Jun 27, 2021, at 07:50, Gosh Arzumanyan
> wrote:
> > >> >> > > > > > > > > > Hi guys,
> > >> >> > > > > > > > > >
> > >> >> > > > > > > > > > 1. Regarding IPC vs Flight: in fact my initial
> suggestion was
> > >> >> > > > to
> > >> >> > > > > add this
> > >> >> > > > > > > > > > feature starting from the IPC(I moved initial
> write up steps to
> > >> >> > > > > the
> > >> >> > > > > > > > > bottom
> > >> >> > > > > > > > > > of the doc). Afterwards David suggested focusing
> on Flight and
> > >> >> > > > > that's how
> > >> >> > > > > > > > > > we ended up with the protobufs change in the
> proposal. This
> > >> >> > > > > being said I
> > >> >> > > > > > > > > do
> > >> >> > > > > > > > > > think that the place where this should be
> impemented is a good
> > >> >> > > > > question
> > >> >> > > > > > > > > on
> > >> >> > > > > > > > > > its own. Maybe it makes sense to have this kind
> of a feature in
> > >> >> > > > > IPC and
> > >> >> > > > > > > > > > somehow use it in Flight, maybe not.
> > >> >> > > > > > > > > > 2. The point about dictionaries deserves a
> dedicated section in
> > >> >> > > > > the
> > >> >> > > > > > > > > > proposal. Nate and David brought it up and
> shared some
> > >> >> > > > insights.
> > >> >> > > > > I'll try
> > >> >> > > > > > > > > > to aggregate them and we can continue the
> discussion form
> > >> >> > > > there.
> > >> >> > > > > > > > > >
> > >> >> > > > > > > > > > Cheers,
> > >> >> > > > > > > > > > Gosh
> > >> >> > > > > > > > > >
> > >> >> > > > > > > > > > On Sat., 26 Jun. 2021, 17:26 Nate Bauernfeind, <
> > >> >> > > > > > > > > natebauernfe...@deephaven.io>
> > >> >> > > > > > > > > > wrote:
> > >> >> > > > > > > > > >
> > >> >> > > > > > > > > > > >
> > >> >> > > > > > > > > > > > > > makes it more difficult to bring schema
> evolution back
> > >> >> > > > > into the
> > >> >> > > > > > > > > > > > > > IPC Stream format (i.e. it would live
> only in flight)
> > >> >> > > > > > > > > > > > >
> > >> >> > > > > > > > > > > > > Gosh's proposal extends the flatbuffer
> structures not the
> > >> >> > > > > > > > > protobufs.
> > >> >> > > > > > > > > > > Can
> > >> >> > > > > > > > > > > > > you help me understand how difficult it
> would be to bring
> > >> >> > > > > the
> > >> >> > > > > > > > > > > `schema_id`
> > >> >> > > > > > > > > > > > > approach to the IPC stream format?
> > >> >> > > > > > > > > > > >
> > >> >> > > > > > > > > > > > I thought we were talking solely about the
> Flight Protobuf
> > >> >> > > > > > > > > definitions -
> > >> >> > > > > > > > > > > > not the Flatbuffers (and the Google doc at
> least only talks
> > >> >> > > > > about the
> > >> >> > > > > > > > > > > > Protobufs).
> > >> >> > > > > > > > > > > >
> > >> >> > > > > > > > > > >
> > >> >> > > > > > > > > > > I somehow missed that schema_id is being added
> to protobuf in
> > >> >> > > > > the
> > >> >> > > > > > > > > document.
> > >> >> > > > > > > > > > > It feels to me that the schema_id is a
> property that would
> > >> >> > > > > ideally only
> > >> >> > > > > > > > > > > apply to the RecordBatch. I better understand
> Micah's
> > >> >> > > > > dictionary
> > >> >> > > > > > > > > concerns,
> > >> >> > > > > > > > > > > now, too.
> > >> >> > > > > > > > > > >
> > >> >> > > > > > > > > > > > Side Question: Why isn't the IPC stream
> format a series of
> > >> >> > > > > the flight
> > >> >> > > > > > > > > > > > > protobufs? It's a real shame that there is
> no standard
> > >> >> > > > way
> > >> >> > > > > to
> > >> >> > > > > > > > > > > > > capture/replay a stream with app_metadata.
> (Obviously
> > >> >> > > > > ignoring the
> > >> >> > > > > > > > > > > > > annoyances around protobuf wrapping
> flatbuffers.)
> > >> >> > > > > > > > > > > >
> > >> >> > > > > > > > > > > > The IPC format was defined long before
> Flight, and Flight's
> > >> >> > > > > > > > > app_metadata
> > >> >> > > > > > > > > > > > was added after Flight's initial definition.
> Note an IPC
> > >> >> > > > > message does
> > >> >> > > > > > > > > > > have
> > >> >> > > > > > > > > > > > a provision for key-value metadata, though I
> think APIs for
> > >> >> > > > > that are
> > >> >> > > > > > > > > not
> > >> >> > > > > > > > > > > > fully exposed. (See ARROW-6940:
> > >> >> > > > > > > > > > > >
> https://issues.apache.org/jira/browse/ARROW-6940 and
> > >> >> > > > > despite my
> > >> >> > > > > > > > > comments
> > >> >> > > > > > > > > > > > there perhaps we need to unify or at least
> consider how
> > >> >> > > > > Flight's
> > >> >> > > > > > > > > > > > app_metadata relates to the IPC message
> custom_metadata.
> > >> >> > > > Also
> > >> >> > > > > > > > > perhaps see
> > >> >> > > > > > > > > > > > ARROW-1059.)
> > >> >> > > > > > > > > > > >
> > >> >> > > > > > > > > > >
> > >> >> > > > > > > > > > > KeyValue unfortunately is string to string. In
> flatbuffer
> > >> >> > > > > strings are
> > >> >> > > > > > > > > only
> > >> >> > > > > > > > > > > UTF-8 or 7-bit ASCII. The app_metadata on the
> other hand is
> > >> >> > > > > opaque
> > >> >> > > > > > > > > bytes.
> > >> >> > > > > > > > > > > The latter is a bit more useful.
> > >> >> > > > > > > > > > >
> > >> >> > > > > > > > > > > --
> > >> >> > > > > > > > > > >
> > >> >> > > > > > > > > >
> > >> >> > > > > > > > >
> > >> >> > > > > > >
> > >> >> > > > >
> > >> >> > > >
> > >> >> > > >
> > >> >> > > > --
> > >> >> > > >
>

Reply via email to