Retitling and forking the discussion to talk about key value pairs.

What is the byte cost of an empty list?  Another option would be to
introduce a new BinaryKeyValue table and add binary metadata.

On Wed, Jul 7, 2021 at 8:32 AM Nate Bauernfeind <
natebauernfe...@deephaven.io> wrote:

> Deephaven and I are very supportive of "upgrading" the value half of the kv
> pair to a byte vector. What is the best way to find out if there is
> sufficient interest?
>
>
> I've been stewing on the ideas here around schema evolution, and I realize
> the specific feature I am missing is the ability to encode that a field
> (i.e. its FieldNode and accompanying Buffers in the RecordBatch) is
> empty/has-no-data in O(0) cost (yes; for free).
>
> Might there be interest in adding a "field_id" to the FieldNode (which is
> encoded on the RecordBatch flatbuffer)? I see a simple forward-compatible
> upgrade (by either keying off of 0, or explicitly set the field default to
> -1) which would allow the sender to "skip" fields that have 1) FieldNode
> length of zero, and 2) all Buffer's associated at that level (and further
> nested) are also equally empty (i.e. Buffer length is zero).
>
> I understand this concept slightly interferes with RecordBatch's `length`
> field, and that many implementations use that length to resize the
> root-level FieldNodes. The use-case I have in mind has different logical
> lengths per field node; current implementations require sending a
> RecordBatch length of the max length across all root level field nodes. I
> believe this requires a copy of data whenever a field node is too short; I
> don't know if there is a decent solution to this slight inefficiency. I am
> bringing it up because if "skipping a field node when it is empty" is a
> feature, then we may not want to allocate space for those nodes given that
> the record batch length will likely be greater than zero.
>
> On Wed, Jul 7, 2021 at 8:12 AM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > On Wed, Jul 7, 2021 at 2:53 PM David Li <apa...@lidavidm.me> wrote:
> > >
> > > From the Flatbuffers internals doc[1] it appears they are the same:
> > "Strings are simply a vector of bytes, and are always null-terminated."
> >
> > I see. I took a look at flatbuffers.h, and it appears that changing
> > this field from string to [byte] would be backward-compatible and
> > forward-compatible except with code that expects a null terminator.
> > This is something we could discuss separately if there were enough
> > interest.
> >
> > > [1]: https://google.github.io/flatbuffers/flatbuffers_internals.html
> > >
> > > -David
> > >
> > > On Wed, Jul 7, 2021, at 05:08, Wes McKinney wrote:
> > > > On Tue, Jul 6, 2021 at 6:33 PM Micah Kornfield <
> emkornfi...@gmail.com>
> > wrote:
> > > > >
> > > > > >
> > > > > > Right, I had wanted to focus the discussion on Flight as I think
> > schema
> > > > > > evolution or multiplexing streams (more so the latter) is a
> > property of the
> > > > > > transport and not the stream format itself. If we are leaning
> > towards just
> > > > > > schema evolution then maybe it makes sense to discuss it for the
> > IPC stream
> > > > > > format and leverage that in Flight. I'd be interested in what
> > others think.
> > > > >
> > > > > I tend to agree, I think stream multiplexing is likely a transport
> > level
> > > > > issue.  IMO I think schema evolution should be consistent with the
> > IPC
> > > > > stream format  and flight.
> > > > >
> > > > >
> > > > > > Nate: it may be worth starting a separate discussion about more
> > general
> > > > > > metadata in the IPC message. I'm not aware of why key-value
> > metadata was
> > > > > > chosen/if opaque bytes were considered in the past.
> > > > >
> > > > >
> > > > > I think  this was an unfortunate design of the key value metadata
> in
> > > > > Schema.fbs, but I don't think I was around when this decision was
> > made.
> > > >
> > > > I agree that it's unfortunate that we did not use [ byte ] instead of
> > > > string for the value in the KeyValue metadata — I think this was more
> > > > of an oversight than a deliberate choice (e.g. it was not our intent
> > > > to require binary data to be base64-encoded — this is something that
> > > > we have to do when encoding binary data in Thrift KeyValue metadata
> > > > for Parquet, for example). Is the binary representation of [byte]
> > > > different from string?
> > > >
> > > >
> > > >
> > > > > Side Question: Why isn't the IPC stream format a series of the
> flight
> > > > > > protobufs?
> > > > >
> > > > > In addition to what David said, protobufs can't be read directly
> > from a
> > > > > memory-mapped file (they need decoding).  This was one of the
> design
> > > > > considerations of using flatbuffers and IPC Stream/File format.
> > > > >
> > > > > I was thinking Micah's comment is more that whatever we do, it
> > should be
> > > > > > clearly specified and edge cases should be considered, especially
> > if we
> > > > > > might want to 'backport' this into the stream format later.
> > > > >
> > > > >
> > > > > Yes, for dictionaries we just need to be careful to define
> semantics
> > and
> > > > > ensure implementations are validating them with regards to
> > dictionaries.
> > > > > There likely isn't any need to change current implementations
> though.
> > > > >
> > > > > On Mon, Jun 28, 2021 at 1:25 PM David Li <lidav...@apache.org>
> > wrote:
> > > > >
> > > > > > Right, I had wanted to focus the discussion on Flight as I think
> > schema
> > > > > > evolution or multiplexing streams (more so the latter) is a
> > property of the
> > > > > > transport and not the stream format itself. If we are leaning
> > towards just
> > > > > > schema evolution then maybe it makes sense to discuss it for the
> > IPC stream
> > > > > > format and leverage that in Flight. I'd be interested in what
> > others think.
> > > > > >
> > > > > > Especially if we are looking at multiplexing streams - I would
> > wonder if
> > > > > > that's actually better served by making it easier to implement
> > using the
> > > > > > Flight implementation as it stands (by managing concurrent RPC
> > calls and/or
> > > > > > performing the union-of-structs encoding trick for you), instead
> > of having
> > > > > > to change the protocol.
> > > > > >
> > > > > > Nate: it may be worth starting a separate discussion about more
> > general
> > > > > > metadata in the IPC message. I'm not aware of why key-value
> > metadata was
> > > > > > chosen/if opaque bytes were considered in the past. Off the top
> of
> > my head
> > > > > > if it's for on-disk storage and fully application-defined it may
> > make sense
> > > > > > to store as a separate file alongside the Arrow file (indexed by
> > record
> > > > > > batch index) where you can take advantage of whatever format is
> > most
> > > > > > suitable.
> > > > > >
> > > > > > -David
> > > > > >
> > > > > > On Sun, Jun 27, 2021, at 07:50, Gosh Arzumanyan wrote:
> > > > > > > Hi guys,
> > > > > > >
> > > > > > > 1. Regarding IPC vs Flight: in fact my initial suggestion was
> to
> > add this
> > > > > > > feature starting from the IPC(I moved initial write up steps to
> > the
> > > > > > bottom
> > > > > > > of the doc). Afterwards David suggested focusing on Flight and
> > that's how
> > > > > > > we ended up with the protobufs change in the proposal. This
> > being said I
> > > > > > do
> > > > > > > think that the place where this should be impemented is a good
> > question
> > > > > > on
> > > > > > > its own. Maybe it makes sense to have this kind of a feature in
> > IPC and
> > > > > > > somehow use it in Flight, maybe not.
> > > > > > > 2. The point about dictionaries deserves a dedicated section in
> > the
> > > > > > > proposal. Nate and David brought it up and shared some
> insights.
> > I'll try
> > > > > > > to aggregate them and we can continue the discussion form
> there.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Gosh
> > > > > > >
> > > > > > > On Sat., 26 Jun. 2021, 17:26 Nate Bauernfeind, <
> > > > > > natebauernfe...@deephaven.io>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > >
> > > > > > > > > > > makes it more difficult to bring schema evolution back
> > into the
> > > > > > > > > > > IPC Stream format (i.e. it would live only in flight)
> > > > > > > > > >
> > > > > > > > > > Gosh's proposal extends the flatbuffer structures not the
> > > > > > protobufs.
> > > > > > > > Can
> > > > > > > > > > you help me understand how difficult it would be to bring
> > the
> > > > > > > > `schema_id`
> > > > > > > > > > approach to the IPC stream format?
> > > > > > > > >
> > > > > > > > > I thought we were talking solely about the Flight Protobuf
> > > > > > definitions -
> > > > > > > > > not the Flatbuffers (and the Google doc at least only talks
> > about the
> > > > > > > > > Protobufs).
> > > > > > > > >
> > > > > > > >
> > > > > > > > I somehow missed that schema_id is being added to protobuf in
> > the
> > > > > > document.
> > > > > > > > It feels to me that the schema_id is a property that would
> > ideally only
> > > > > > > > apply to the RecordBatch. I better understand Micah's
> > dictionary
> > > > > > concerns,
> > > > > > > > now, too.
> > > > > > > >
> > > > > > > > > Side Question: Why isn't the IPC stream format a series of
> > the flight
> > > > > > > > > > protobufs? It's a real shame that there is no standard
> way
> > to
> > > > > > > > > > capture/replay a stream with app_metadata. (Obviously
> > ignoring the
> > > > > > > > > > annoyances around protobuf wrapping flatbuffers.)
> > > > > > > > >
> > > > > > > > > The IPC format was defined long before Flight, and Flight's
> > > > > > app_metadata
> > > > > > > > > was added after Flight's initial definition. Note an IPC
> > message does
> > > > > > > > have
> > > > > > > > > a provision for key-value metadata, though I think APIs for
> > that are
> > > > > > not
> > > > > > > > > fully exposed. (See ARROW-6940:
> > > > > > > > > https://issues.apache.org/jira/browse/ARROW-6940 and
> > despite my
> > > > > > comments
> > > > > > > > > there perhaps we need to unify or at least consider how
> > Flight's
> > > > > > > > > app_metadata relates to the IPC message custom_metadata.
> Also
> > > > > > perhaps see
> > > > > > > > > ARROW-1059.)
> > > > > > > > >
> > > > > > > >
> > > > > > > > KeyValue unfortunately is string to string. In flatbuffer
> > strings are
> > > > > > only
> > > > > > > > UTF-8 or 7-bit ASCII. The app_metadata on the other hand is
> > opaque
> > > > > > bytes.
> > > > > > > > The latter is a bit more useful.
> > > > > > > >
> > > > > > > > --
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> >
>
>
> --
>

Reply via email to