Sorry I just took a closer look and left some comments.  I think the one
substantive issue, is the document linked talks about different
length columns in the Bag, and this isn't mentioned in the flatbuffers?
Could you comment/update the documentations in flatbuffers accordingly?

Thanks,
Micah

On Tue, Nov 23, 2021 at 10:41 AM David Li <lidav...@apache.org> wrote:

> Thanks for putting that up.
>
> It doesn't look like there's been too much discussion here. If people
> agree it's useful, maybe the next step is to draft an implementation in
> Java or C++ for feedback? There was some discussion on the use cases in the
> document, do we feel like we need to clarify that better?
>
> -David
>
> On Mon, Nov 8, 2021, at 16:46, Nate Bauernfeind wrote:
> > I put the draft up here: https://github.com/apache/arrow/pull/11646
> >
> > Thanks.
> >
> > On Mon, Nov 8, 2021 at 1:57 PM David Li <lidav...@apache.org> wrote:
> >
> > > Hey Nate,
> > >
> > > Thanks for doing this! Would you be interested in putting that commit
> up
> > > as a draft PR for discussion? I think we can discuss there.
> > >
> > > I'm not sure anyone is actively working on RLE or other encoding
> schemes
> > > at the moment.
> > >
> > > -David
> > >
> > > On Mon, Nov 8, 2021, at 13:19, Nate Bauernfeind wrote:
> > > > I've written up the ColumnBag proposal addressing items 1 and 2 on
> the
> > > > list. I'm open to any and all feedback/suggestions.
> > > >
> > > > I'd be happy to add item 3 (binary metadata) to the proposed change
> set.
> > > > Let me know if you want me to whip up the initial suggestion for that
> > > > version (and whether or not to keep it separate from ColumnBag).
> > > >
> > > > Would RLE related efforts change the structure of RecordBatch or
> > > ColumnBag
> > > > (if accepted)?
> > > >
> > > > Here is the brief history-discussion around why ColumnBag:
> > > >
> > >
> https://docs.google.com/document/d/1jsmmqLTyJkU8fx0sUGIqd6yu72N4v9uHFsuGSgB_DfE/
> > > >
> > > > Here is a brief commit doctoring up the flatbuffer to support this
> > > version
> > > > of the proposed change:
> > > > https://github.com/nbauernfeind/arrow/tree/column_bag_demo_v1
> > > >
> > > > I don't know if it's better to comment in the document or bring
> comments
> > > > back to the list. If it ends up being document heavy, then I'll
> summarize
> > > > the main points back on the list.
> > > >
> > > > I think I'll get started on a Java impl just to learn more even if it
> > > ends
> > > > up being extra work.
> > > >
> > > > Looking forward to your feedback,
> > > > Nate
> > > >
> > > > On Mon, Aug 9, 2021 at 10:06 PM Micah Kornfield <
> emkornfi...@gmail.com>
> > > > wrote:
> > > >
> > > > > I'm still interested in RLE related effort, but not sure about my
> > > available
> > > > > bandwidth (which is why I haven't made more of an effort there).
> > > > >
> > > > > On Tue, Aug 3, 2021 at 6:00 PM Wes McKinney <wesmck...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Another Flatbuffers/Message.fbs project we should rekindle soon,
> in
> > > > > > addition to the schema evolution/replacement question which has
> been
> > > > > > raised with Flight, is that of sparse/compressed data (e.g.
> RLE). I
> > > > > > have a vacation plus some travel coming up so won't be able to
> devote
> > > > > > meaningful attention to this until the last part of August, but
> would
> > > > > > like to help it move forward.
> > > > > >
> > > > > >
> > > > > > On Tue, Jul 27, 2021 at 1:40 PM David Li <lidav...@apache.org>
> > > wrote:
> > > > > > >
> > > > > > > Hey Nate,
> > > > > > >
> > > > > > > For the first two points, semantically I'm tempted to think of
> it
> > > more
> > > > > > like the ability to send a "bag of columns" according to some
> schema
> > > (and
> > > > > > hence columns could have differing lengths or even be absent).
> This
> > > could
> > > > > > be a new structure alongside a record batch, which is
> semantically
> > > like a
> > > > > > "slice of a table" (and hence rectangular and complete), instead
> of
> > > > > > exposing existing users of RecordBatch to rather different
> behavior.
> > > > > > >
> > > > > > > For #3, a different thread was discussing some of the points
> there
> > > - it
> > > > > > sounds like it may be possible to relax from map<string, string>
> to
> > > > > > map<string, binary>.
> > > > > > >
> > > > > > > -David
> > > > > > >
> > > > > > > On Mon, Jul 26, 2021, at 11:01, Nate Bauernfeind wrote:
> > > > > > > > Wes suggested that maybe there are enough new ideas that it
> may
> > > make
> > > > > > sense
> > > > > > > > to evolve-past the existing structures rather than to
> bolt-on new
> > > > > > > > functionality. I would like to learn what requirements exist
> > > should
> > > > > new
> > > > > > > > structures be adopted, and if applicable, would like to turn
> this
> > > > > into
> > > > > > a
> > > > > > > > full POC proposal.
> > > > > > > >
> > > > > > > > These are the features that I feel are missing from the
> existing
> > > > > > design:
> > > > > > > > - the ability to notify that the columns are not consistent
> in
> > > length
> > > > > > (e.g.
> > > > > > > > setting RecordBatch.length to -1; and give the arrow/flight
> user
> > > the
> > > > > > true
> > > > > > > > FieldNode lengths).
> > > > > > > > - the ability to skip top-level field nodes that have length
> 0
> > > at a
> > > > > > small
> > > > > > > > cost (such as in a bitset)
> > > > > > > > - the ability to embed binary payload in the Message
> flatbuffer
> > > > > wrapper
> > > > > > > > (instead of String payload only)
> > > > > > > > - the ability to concurrently use more than one schema (the
> most
> > > > > > likely API
> > > > > > > > will look like how one identifies a dictionary. ideally
> > > dictionaries
> > > > > > could
> > > > > > > > be shared across field nodes in a schema or across schemas
> in the
> > > > > same
> > > > > > > > flight)
> > > > > > > >
> > > > > > > > What other features, or improvements, could/should be
> > > considered? Any
> > > > > > > > strong opinions against the ideas above? (Remember, that a
> goal
> > > of
> > > > > > mine is
> > > > > > > > to be able to send a RecordBatch of rows that were modified
> > > > > intersected
> > > > > > > > only by the field-nodes that have changed (including those
> with
> > > only
> > > > > > inner
> > > > > > > > node changes); thus the columns are a subset of the full
> schema
> > > and
> > > > > > that
> > > > > > > > the length of each node is independent of the other).
> > > > > > > >
> > > > > > > > On Fri, Jul 9, 2021 at 9:26 AM Wes McKinney <
> wesmck...@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > > > It sounds like we may want to discuss some potential
> > > evolutions of
> > > > > > the
> > > > > > > > > Arrow binary protocol (for example: new Message types).
> > > Certainly a
> > > > > > > > > can of worms but rather than trying to bolt some new
> > > functionality
> > > > > > > > > onto the existing structures, it might be better to support
> > > the new
> > > > > > > > > use cases through some new structures which will be more
> clear
> > > cut
> > > > > > > > > from a forward compatibility standpoint.
> > > > > > > >
> > > > > > > > Nate
> > > > > > > >
> > > > > > > > --
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > >
> > >
> >
> >
> > --
> >
>

Reply via email to