Re: [DISCUSS] 64-bit offset variable width types (i.e.Large List, Last String, Large bytes)

Micah Kornfield Wed, 24 Apr 2019 21:17:16 -0700

Given that conversation seems to have died down on this, would it make
sense to do a vote to allow for large variable width types to be added?  As
discussed previously PRs would need both C++ and Java implementation before
being merged.


Could a PMC member facilitate this?

Philipp if approved, do you have bandwidth to finish up the PR for
LargeList?

Thanks,
Micah

On Mon, Apr 15, 2019 at 11:16 PM Philipp Moritz <pcmor...@gmail.com> wrote:

> @Micah: I wanted to make it possible to support serializing large objects
> (existing large pandas dataframes with an "object" column and also large
> python types with the pyarrow serialization).
>
> On Mon, Apr 15, 2019 at 8:22 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> To summarize my understanding of the thread so far, there seems to be
>> consensus on having a new distinct type for each "large" type.
>>
>> There are some reservations around the "large" types being harder to
>> support in algorithmic implementations.
>>
>> I'm curious Philipp, was there a concrete use-case that inspired you to
>> start the PR?
>>
>> Also, this was brought up on another thread, but utility of the "large"
>> types might be limited in some languages (e.g. Java) until they support
>> buffer sizes larger then INT_MAX bytes.  I brought this up on the current
>> PR to decouple Netty and memory management from ArrowBuf [1], but the
>> consensus seems to be to handle any modifications in follow-up PRs (if they
>> are agreed upon).
>>
>> Anything else people want to discuss before a vote on whether to allow
>> the additional types into the spec?
>>
>> Thanks,
>> Micah
>>
>> [1] https://github.com/apache/arrow/pull/4151
>>
>>
>>
>>
>> On Monday, April 15, 2019, Jacques Nadeau <jacq...@apache.org> wrote:
>>
>>> I am not Jacques, but I will try to give my own point of view on this.
>>> >
>>>
>>> Thanks for making me laugh :)
>>>
>>> I think that this is unavoidable. Even with batches, taking an example
>>> of a
>>> > binary column where the mean size of the payload is 1mb, it limits to
>>> > batches of 2048 elements. This can become annoying pretty quickly.
>>> >
>>>
>>> Good example. I'm not sure columnar matters but I find it more useful
>>> than
>>> others.
>>>
>>> logical types and physical types
>>> >
>>>
>>> TLDR; It is painful no matter which model you pick.
>>>
>>> I definitely think we worked hard to go different on Arrow than Parquet.
>>> It
>>> was something I pushed consciously when we started as I found some of the
>>> patterns in Parquet to be quite challenging. Unfortunately, we went too
>>> far
>>> in some places in the Java code which tried to parallel the structure of
>>> the physical types directly (and thus the big refactor we did to reduce
>>> duplication last year -- props to Sidd, Bryan and the others who worked
>>> on
>>> that). I also think that we easily probably lost as much as we gained
>>> using
>>> the current model.
>>>
>>> I agree with Antoine both in his clean statement of the approaches and
>>> that
>>> sticking to the model we have today makes the most sense.
>>>
>>> On Mon, Apr 15, 2019 at 11:05 AM Francois Saint-Jacques <
>>> fsaintjacq...@gmail.com> wrote:
>>>
>>> > Thanks for the clarification Antoine, very insightful.
>>> >
>>> > I'd also vote for keeping the existing model for consistency.
>>> >
>>> > On Mon, Apr 15, 2019 at 1:40 PM Antoine Pitrou <anto...@python.org>
>>> wrote:
>>> >
>>> > >
>>> > > Hi,
>>> > >
>>> > > I am not Jacques, but I will try to give my own point of view on
>>> this.
>>> > >
>>> > > The distinction between logical and physical types can be modelled in
>>> > > two different ways:
>>> > >
>>> > > 1) a physical type can denote several logical types, but a logical
>>> type
>>> > > can only have a single physical representation.  This is currently
>>> the
>>> > > Arrow model.
>>> > >
>>> > > 2) a physical type can denote several logical types, and a logical
>>> type
>>> > > can also be denoted by several physical types.  This is the Parquet
>>> > model.
>>> > >
>>> > > (theoretically, there are two other possible models, but they are not
>>> > > very interesting to consider, since they don't seem to cater to
>>> concrete
>>> > > use cases)
>>> > >
>>> > > Model 1 is obviously more restrictive, while model 2 is more
>>> flexible.
>>> > > Model 2 could be said "higher level"; you see something similar if
>>> you
>>> > > compare Python's and C++'s typing systems.  On the other hand, model
>>> 1
>>> > > provides a potentially simpler programming model for implementors of
>>> > > low-level kernels, as you can simply query the logical type of your
>>> data
>>> > > and you automatically know its physical type.
>>> > >
>>> > > The model chosen for Arrow is ingrained in its API.  If we want to
>>> > > change the model we'd better do it wholesale (implying probably a
>>> large
>>> > > refactoring and a significant number of unavoidable regressions) to
>>> > > avoid subjecting users to a confusing middle point.
>>> > >
>>> > > Also and as a sidenote, "convertibility" between different types can
>>> be
>>> > > a hairy subject... Having strict boundaries between types avoids
>>> being
>>> > > dragged into it too early.
>>> > >
>>> > >
>>> > > To return to the original subject: IMHO, LargeList (resp.
>>> LargeBinary)
>>> > > should be a distinct logical type from List (resp. Binary), the same
>>> way
>>> > > Int64 is a distinct logical type from Int32.
>>> > >
>>> > > Regards
>>> > >
>>> > > Antoine.
>>> > >
>>> > >
>>> > >
>>> > > Le 15/04/2019 à 18:45, Francois Saint-Jacques a écrit :
>>> > > > Hello,
>>> > > >
>>> > > > I would like understand where do we stand on logical types and
>>> physical
>>> > > > types. As I understand, this proposal is for the physical
>>> > representation.
>>> > > >
>>> > > > In the context of an execution engine, the concept of logical types
>>> > > becomes
>>> > > > more important as two physical representation might have the same
>>> > > semantical
>>> > > > values, e.g. LargeList and List where all values fits in the
>>> 32-bits.
>>> > A
>>> > > > more
>>> > > > complex example would be an Integer array and a dictionary array
>>> where
>>> > > > values
>>> > > > are integers.
>>> > > >
>>> > > > Is this something only something only relevant for execution
>>> engine?
>>> > What
>>> > > > about
>>> > > > the (C++) Array.Equals method and related comparisons methods? This
>>> > also
>>> > > > touch
>>> > > > the subject of type equality, e.g. dict with different but
>>> compatible
>>> > > > encoding.
>>> > > >
>>> > > > Jacques, knowing that you worked on Parquet (which follows this
>>> model)
>>> > > and
>>> > > > Dremio,
>>> > > > what is your opinion?
>>> > > >
>>> > > > François
>>> > > >
>>> > > > Some related tickets:
>>> > > > - https://jira.apache.org/jira/browse/ARROW-554
>>> > > > - https://jira.apache.org/jira/browse/ARROW-1741
>>> > > > - https://jira.apache.org/jira/browse/ARROW-3144
>>> > > > - https://jira.apache.org/jira/browse/ARROW-4097
>>> > > > - https://jira.apache.org/jira/browse/ARROW-5052
>>> > > >
>>> > > >
>>> > > >
>>> > > > On Thu, Apr 11, 2019 at 4:52 AM Micah Kornfield <
>>> emkornfi...@gmail.com
>>> > >
>>> > > > wrote:
>>> > > >
>>> > > >> ARROW-4810 [1] and ARROW-750 [2] discuss adding types with 64-bit
>>> > > offsets
>>> > > >> to Lists, Strings and binary data types.
>>> > > >>
>>> > > >> Philipp started an implementation for the large list type [3] and
>>> I
>>> > > hacked
>>> > > >> together a potentially viable java implementation [4]
>>> > > >>
>>> > > >> I'd like to kickoff the discussion for getting these types voted
>>> on.
>>> > > I'm
>>> > > >> coupling them together because I think there are design
>>> consideration
>>> > > for
>>> > > >> how we evolve Schema.fbs
>>> > > >>
>>> > > >> There are two proposed options:
>>> > > >> 1.  The current PR proposal which adds a new type LargeList:
>>> > > >>   // List with 64-bit offsets
>>> > > >>   table LargeList {}
>>> > > >>
>>> > > >> 2.  As François suggested, it might cleaner to parameterize List
>>> with
>>> > > >> offset width.  I suppose something like:
>>> > > >>
>>> > > >> table List {
>>> > > >>   // only 32 bit and 64 bit is supported.
>>> > > >>   bitWidth: int = 32;
>>> > > >> }
>>> > > >>
>>> > > >> I think Option 2 is cleaner and potentially better long-term, but
>>> I
>>> > > think
>>> > > >> it breaks forward compatibility of the existing arrow libraries.
>>> If
>>> > we
>>> > > >> proceed with Option 2, I would advocate making the change to
>>> > Schema.fbs
>>> > > all
>>> > > >> at once for all types (assuming we think that 64-bit offsets are
>>> > > desirable
>>> > > >> for all types) along with future compatibility checks to avoid
>>> > multiple
>>> > > >> releases were future compatibility is broken (by broken I mean the
>>> > > >> inability to detect that an implementation is receiving data it
>>> can't
>>> > > >> read).    What are peoples thoughts on this?
>>> > > >>
>>> > > >> Also, any other concern with adding these types?
>>> > > >>
>>> > > >> Thanks,
>>> > > >> Micah
>>> > > >>
>>> > > >> [1] https://issues.apache.org/jira/browse/ARROW-4810
>>> > > >> [2] https://issues.apache.org/jira/browse/ARROW-750
>>> > > >> [3] https://github.com/apache/arrow/pull/3848
>>> > > >> [4]
>>> > > >>
>>> > > >>
>>> > >
>>> >
>>> https://github.com/apache/arrow/commit/03956cac2202139e43404d7a994508080dc2cdd1
>>> > > >>
>>> > > >
>>> > >
>>> >
>>>
>>

Re: [DISCUSS] 64-bit offset variable width types (i.e.Large List, Last String, Large bytes)

Reply via email to