Re: [DISCUSS] 64-bit offset variable width types (i.e.Large List, Last String, Large bytes)

Wes McKinney Thu, 25 Apr 2019 05:54:03 -0700

+1 to a vote (I can kick it off today). I had planned to respond
earlier this week but jet lag and e-mail backlog got the better of me


On Thu, Apr 25, 2019 at 3:09 AM Antoine Pitrou <anto...@python.org> wrote:
>
>
> Voting sounds like a good idea to me.
>
> Regards
>
> Antoine.
>
>
> Le 25/04/2019 à 06:16, Micah Kornfield a écrit :
> > Given that conversation seems to have died down on this, would it make
> > sense to do a vote to allow for large variable width types to be added?  As
> > discussed previously PRs would need both C++ and Java implementation before
> > being merged.
> >
> > Could a PMC member facilitate this?
> >
> > Philipp if approved, do you have bandwidth to finish up the PR for
> > LargeList?
> >
> > Thanks,
> > Micah
> >
> > On Mon, Apr 15, 2019 at 11:16 PM Philipp Moritz <pcmor...@gmail.com> wrote:
> >
> >> @Micah: I wanted to make it possible to support serializing large objects
> >> (existing large pandas dataframes with an "object" column and also large
> >> python types with the pyarrow serialization).
> >>
> >> On Mon, Apr 15, 2019 at 8:22 PM Micah Kornfield <emkornfi...@gmail.com>
> >> wrote:
> >>
> >>> To summarize my understanding of the thread so far, there seems to be
> >>> consensus on having a new distinct type for each "large" type.
> >>>
> >>> There are some reservations around the "large" types being harder to
> >>> support in algorithmic implementations.
> >>>
> >>> I'm curious Philipp, was there a concrete use-case that inspired you to
> >>> start the PR?
> >>>
> >>> Also, this was brought up on another thread, but utility of the "large"
> >>> types might be limited in some languages (e.g. Java) until they support
> >>> buffer sizes larger then INT_MAX bytes.  I brought this up on the current
> >>> PR to decouple Netty and memory management from ArrowBuf [1], but the
> >>> consensus seems to be to handle any modifications in follow-up PRs (if 
> >>> they
> >>> are agreed upon).
> >>>
> >>> Anything else people want to discuss before a vote on whether to allow
> >>> the additional types into the spec?
> >>>
> >>> Thanks,
> >>> Micah
> >>>
> >>> [1] https://github.com/apache/arrow/pull/4151
> >>>
> >>>
> >>>
> >>>
> >>> On Monday, April 15, 2019, Jacques Nadeau <jacq...@apache.org> wrote:
> >>>
> >>>> I am not Jacques, but I will try to give my own point of view on this.
> >>>>>
> >>>>
> >>>> Thanks for making me laugh :)
> >>>>
> >>>> I think that this is unavoidable. Even with batches, taking an example
> >>>> of a
> >>>>> binary column where the mean size of the payload is 1mb, it limits to
> >>>>> batches of 2048 elements. This can become annoying pretty quickly.
> >>>>>
> >>>>
> >>>> Good example. I'm not sure columnar matters but I find it more useful
> >>>> than
> >>>> others.
> >>>>
> >>>> logical types and physical types
> >>>>>
> >>>>
> >>>> TLDR; It is painful no matter which model you pick.
> >>>>
> >>>> I definitely think we worked hard to go different on Arrow than Parquet.
> >>>> It
> >>>> was something I pushed consciously when we started as I found some of the
> >>>> patterns in Parquet to be quite challenging. Unfortunately, we went too
> >>>> far
> >>>> in some places in the Java code which tried to parallel the structure of
> >>>> the physical types directly (and thus the big refactor we did to reduce
> >>>> duplication last year -- props to Sidd, Bryan and the others who worked
> >>>> on
> >>>> that). I also think that we easily probably lost as much as we gained
> >>>> using
> >>>> the current model.
> >>>>
> >>>> I agree with Antoine both in his clean statement of the approaches and
> >>>> that
> >>>> sticking to the model we have today makes the most sense.
> >>>>
> >>>> On Mon, Apr 15, 2019 at 11:05 AM Francois Saint-Jacques <
> >>>> fsaintjacq...@gmail.com> wrote:
> >>>>
> >>>>> Thanks for the clarification Antoine, very insightful.
> >>>>>
> >>>>> I'd also vote for keeping the existing model for consistency.
> >>>>>
> >>>>> On Mon, Apr 15, 2019 at 1:40 PM Antoine Pitrou <anto...@python.org>
> >>>> wrote:
> >>>>>
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> I am not Jacques, but I will try to give my own point of view on
> >>>> this.
> >>>>>>
> >>>>>> The distinction between logical and physical types can be modelled in
> >>>>>> two different ways:
> >>>>>>
> >>>>>> 1) a physical type can denote several logical types, but a logical
> >>>> type
> >>>>>> can only have a single physical representation.  This is currently
> >>>> the
> >>>>>> Arrow model.
> >>>>>>
> >>>>>> 2) a physical type can denote several logical types, and a logical
> >>>> type
> >>>>>> can also be denoted by several physical types.  This is the Parquet
> >>>>> model.
> >>>>>>
> >>>>>> (theoretically, there are two other possible models, but they are not
> >>>>>> very interesting to consider, since they don't seem to cater to
> >>>> concrete
> >>>>>> use cases)
> >>>>>>
> >>>>>> Model 1 is obviously more restrictive, while model 2 is more
> >>>> flexible.
> >>>>>> Model 2 could be said "higher level"; you see something similar if
> >>>> you
> >>>>>> compare Python's and C++'s typing systems.  On the other hand, model
> >>>> 1
> >>>>>> provides a potentially simpler programming model for implementors of
> >>>>>> low-level kernels, as you can simply query the logical type of your
> >>>> data
> >>>>>> and you automatically know its physical type.
> >>>>>>
> >>>>>> The model chosen for Arrow is ingrained in its API.  If we want to
> >>>>>> change the model we'd better do it wholesale (implying probably a
> >>>> large
> >>>>>> refactoring and a significant number of unavoidable regressions) to
> >>>>>> avoid subjecting users to a confusing middle point.
> >>>>>>
> >>>>>> Also and as a sidenote, "convertibility" between different types can
> >>>> be
> >>>>>> a hairy subject... Having strict boundaries between types avoids
> >>>> being
> >>>>>> dragged into it too early.
> >>>>>>
> >>>>>>
> >>>>>> To return to the original subject: IMHO, LargeList (resp.
> >>>> LargeBinary)
> >>>>>> should be a distinct logical type from List (resp. Binary), the same
> >>>> way
> >>>>>> Int64 is a distinct logical type from Int32.
> >>>>>>
> >>>>>> Regards
> >>>>>>
> >>>>>> Antoine.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Le 15/04/2019 à 18:45, Francois Saint-Jacques a écrit :
> >>>>>>> Hello,
> >>>>>>>
> >>>>>>> I would like understand where do we stand on logical types and
> >>>> physical
> >>>>>>> types. As I understand, this proposal is for the physical
> >>>>> representation.
> >>>>>>>
> >>>>>>> In the context of an execution engine, the concept of logical types
> >>>>>> becomes
> >>>>>>> more important as two physical representation might have the same
> >>>>>> semantical
> >>>>>>> values, e.g. LargeList and List where all values fits in the
> >>>> 32-bits.
> >>>>> A
> >>>>>>> more
> >>>>>>> complex example would be an Integer array and a dictionary array
> >>>> where
> >>>>>>> values
> >>>>>>> are integers.
> >>>>>>>
> >>>>>>> Is this something only something only relevant for execution
> >>>> engine?
> >>>>> What
> >>>>>>> about
> >>>>>>> the (C++) Array.Equals method and related comparisons methods? This
> >>>>> also
> >>>>>>> touch
> >>>>>>> the subject of type equality, e.g. dict with different but
> >>>> compatible
> >>>>>>> encoding.
> >>>>>>>
> >>>>>>> Jacques, knowing that you worked on Parquet (which follows this
> >>>> model)
> >>>>>> and
> >>>>>>> Dremio,
> >>>>>>> what is your opinion?
> >>>>>>>
> >>>>>>> François
> >>>>>>>
> >>>>>>> Some related tickets:
> >>>>>>> - https://jira.apache.org/jira/browse/ARROW-554
> >>>>>>> - https://jira.apache.org/jira/browse/ARROW-1741
> >>>>>>> - https://jira.apache.org/jira/browse/ARROW-3144
> >>>>>>> - https://jira.apache.org/jira/browse/ARROW-4097
> >>>>>>> - https://jira.apache.org/jira/browse/ARROW-5052
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Apr 11, 2019 at 4:52 AM Micah Kornfield <
> >>>> emkornfi...@gmail.com
> >>>>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> ARROW-4810 [1] and ARROW-750 [2] discuss adding types with 64-bit
> >>>>>> offsets
> >>>>>>>> to Lists, Strings and binary data types.
> >>>>>>>>
> >>>>>>>> Philipp started an implementation for the large list type [3] and
> >>>> I
> >>>>>> hacked
> >>>>>>>> together a potentially viable java implementation [4]
> >>>>>>>>
> >>>>>>>> I'd like to kickoff the discussion for getting these types voted
> >>>> on.
> >>>>>> I'm
> >>>>>>>> coupling them together because I think there are design
> >>>> consideration
> >>>>>> for
> >>>>>>>> how we evolve Schema.fbs
> >>>>>>>>
> >>>>>>>> There are two proposed options:
> >>>>>>>> 1.  The current PR proposal which adds a new type LargeList:
> >>>>>>>>   // List with 64-bit offsets
> >>>>>>>>   table LargeList {}
> >>>>>>>>
> >>>>>>>> 2.  As François suggested, it might cleaner to parameterize List
> >>>> with
> >>>>>>>> offset width.  I suppose something like:
> >>>>>>>>
> >>>>>>>> table List {
> >>>>>>>>   // only 32 bit and 64 bit is supported.
> >>>>>>>>   bitWidth: int = 32;
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> I think Option 2 is cleaner and potentially better long-term, but
> >>>> I
> >>>>>> think
> >>>>>>>> it breaks forward compatibility of the existing arrow libraries.
> >>>> If
> >>>>> we
> >>>>>>>> proceed with Option 2, I would advocate making the change to
> >>>>> Schema.fbs
> >>>>>> all
> >>>>>>>> at once for all types (assuming we think that 64-bit offsets are
> >>>>>> desirable
> >>>>>>>> for all types) along with future compatibility checks to avoid
> >>>>> multiple
> >>>>>>>> releases were future compatibility is broken (by broken I mean the
> >>>>>>>> inability to detect that an implementation is receiving data it
> >>>> can't
> >>>>>>>> read).    What are peoples thoughts on this?
> >>>>>>>>
> >>>>>>>> Also, any other concern with adding these types?
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Micah
> >>>>>>>>
> >>>>>>>> [1] https://issues.apache.org/jira/browse/ARROW-4810
> >>>>>>>> [2] https://issues.apache.org/jira/browse/ARROW-750
> >>>>>>>> [3] https://github.com/apache/arrow/pull/3848
> >>>>>>>> [4]
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>> https://github.com/apache/arrow/commit/03956cac2202139e43404d7a994508080dc2cdd1
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >

Re: [DISCUSS] 64-bit offset variable width types (i.e.Large List, Last String, Large bytes)

Reply via email to