Given that conversation seems to have died down on this, would it make sense to do a vote to allow for large variable width types to be added? As discussed previously PRs would need both C++ and Java implementation before being merged.
Could a PMC member facilitate this? Philipp if approved, do you have bandwidth to finish up the PR for LargeList? Thanks, Micah On Mon, Apr 15, 2019 at 11:16 PM Philipp Moritz <[email protected]> wrote: > @Micah: I wanted to make it possible to support serializing large objects > (existing large pandas dataframes with an "object" column and also large > python types with the pyarrow serialization). > > On Mon, Apr 15, 2019 at 8:22 PM Micah Kornfield <[email protected]> > wrote: > >> To summarize my understanding of the thread so far, there seems to be >> consensus on having a new distinct type for each "large" type. >> >> There are some reservations around the "large" types being harder to >> support in algorithmic implementations. >> >> I'm curious Philipp, was there a concrete use-case that inspired you to >> start the PR? >> >> Also, this was brought up on another thread, but utility of the "large" >> types might be limited in some languages (e.g. Java) until they support >> buffer sizes larger then INT_MAX bytes. I brought this up on the current >> PR to decouple Netty and memory management from ArrowBuf [1], but the >> consensus seems to be to handle any modifications in follow-up PRs (if they >> are agreed upon). >> >> Anything else people want to discuss before a vote on whether to allow >> the additional types into the spec? >> >> Thanks, >> Micah >> >> [1] https://github.com/apache/arrow/pull/4151 >> >> >> >> >> On Monday, April 15, 2019, Jacques Nadeau <[email protected]> wrote: >> >>> I am not Jacques, but I will try to give my own point of view on this. >>> > >>> >>> Thanks for making me laugh :) >>> >>> I think that this is unavoidable. Even with batches, taking an example >>> of a >>> > binary column where the mean size of the payload is 1mb, it limits to >>> > batches of 2048 elements. This can become annoying pretty quickly. >>> > >>> >>> Good example. I'm not sure columnar matters but I find it more useful >>> than >>> others. >>> >>> logical types and physical types >>> > >>> >>> TLDR; It is painful no matter which model you pick. >>> >>> I definitely think we worked hard to go different on Arrow than Parquet. >>> It >>> was something I pushed consciously when we started as I found some of the >>> patterns in Parquet to be quite challenging. Unfortunately, we went too >>> far >>> in some places in the Java code which tried to parallel the structure of >>> the physical types directly (and thus the big refactor we did to reduce >>> duplication last year -- props to Sidd, Bryan and the others who worked >>> on >>> that). I also think that we easily probably lost as much as we gained >>> using >>> the current model. >>> >>> I agree with Antoine both in his clean statement of the approaches and >>> that >>> sticking to the model we have today makes the most sense. >>> >>> On Mon, Apr 15, 2019 at 11:05 AM Francois Saint-Jacques < >>> [email protected]> wrote: >>> >>> > Thanks for the clarification Antoine, very insightful. >>> > >>> > I'd also vote for keeping the existing model for consistency. >>> > >>> > On Mon, Apr 15, 2019 at 1:40 PM Antoine Pitrou <[email protected]> >>> wrote: >>> > >>> > > >>> > > Hi, >>> > > >>> > > I am not Jacques, but I will try to give my own point of view on >>> this. >>> > > >>> > > The distinction between logical and physical types can be modelled in >>> > > two different ways: >>> > > >>> > > 1) a physical type can denote several logical types, but a logical >>> type >>> > > can only have a single physical representation. This is currently >>> the >>> > > Arrow model. >>> > > >>> > > 2) a physical type can denote several logical types, and a logical >>> type >>> > > can also be denoted by several physical types. This is the Parquet >>> > model. >>> > > >>> > > (theoretically, there are two other possible models, but they are not >>> > > very interesting to consider, since they don't seem to cater to >>> concrete >>> > > use cases) >>> > > >>> > > Model 1 is obviously more restrictive, while model 2 is more >>> flexible. >>> > > Model 2 could be said "higher level"; you see something similar if >>> you >>> > > compare Python's and C++'s typing systems. On the other hand, model >>> 1 >>> > > provides a potentially simpler programming model for implementors of >>> > > low-level kernels, as you can simply query the logical type of your >>> data >>> > > and you automatically know its physical type. >>> > > >>> > > The model chosen for Arrow is ingrained in its API. If we want to >>> > > change the model we'd better do it wholesale (implying probably a >>> large >>> > > refactoring and a significant number of unavoidable regressions) to >>> > > avoid subjecting users to a confusing middle point. >>> > > >>> > > Also and as a sidenote, "convertibility" between different types can >>> be >>> > > a hairy subject... Having strict boundaries between types avoids >>> being >>> > > dragged into it too early. >>> > > >>> > > >>> > > To return to the original subject: IMHO, LargeList (resp. >>> LargeBinary) >>> > > should be a distinct logical type from List (resp. Binary), the same >>> way >>> > > Int64 is a distinct logical type from Int32. >>> > > >>> > > Regards >>> > > >>> > > Antoine. >>> > > >>> > > >>> > > >>> > > Le 15/04/2019 à 18:45, Francois Saint-Jacques a écrit : >>> > > > Hello, >>> > > > >>> > > > I would like understand where do we stand on logical types and >>> physical >>> > > > types. As I understand, this proposal is for the physical >>> > representation. >>> > > > >>> > > > In the context of an execution engine, the concept of logical types >>> > > becomes >>> > > > more important as two physical representation might have the same >>> > > semantical >>> > > > values, e.g. LargeList and List where all values fits in the >>> 32-bits. >>> > A >>> > > > more >>> > > > complex example would be an Integer array and a dictionary array >>> where >>> > > > values >>> > > > are integers. >>> > > > >>> > > > Is this something only something only relevant for execution >>> engine? >>> > What >>> > > > about >>> > > > the (C++) Array.Equals method and related comparisons methods? This >>> > also >>> > > > touch >>> > > > the subject of type equality, e.g. dict with different but >>> compatible >>> > > > encoding. >>> > > > >>> > > > Jacques, knowing that you worked on Parquet (which follows this >>> model) >>> > > and >>> > > > Dremio, >>> > > > what is your opinion? >>> > > > >>> > > > François >>> > > > >>> > > > Some related tickets: >>> > > > - https://jira.apache.org/jira/browse/ARROW-554 >>> > > > - https://jira.apache.org/jira/browse/ARROW-1741 >>> > > > - https://jira.apache.org/jira/browse/ARROW-3144 >>> > > > - https://jira.apache.org/jira/browse/ARROW-4097 >>> > > > - https://jira.apache.org/jira/browse/ARROW-5052 >>> > > > >>> > > > >>> > > > >>> > > > On Thu, Apr 11, 2019 at 4:52 AM Micah Kornfield < >>> [email protected] >>> > > >>> > > > wrote: >>> > > > >>> > > >> ARROW-4810 [1] and ARROW-750 [2] discuss adding types with 64-bit >>> > > offsets >>> > > >> to Lists, Strings and binary data types. >>> > > >> >>> > > >> Philipp started an implementation for the large list type [3] and >>> I >>> > > hacked >>> > > >> together a potentially viable java implementation [4] >>> > > >> >>> > > >> I'd like to kickoff the discussion for getting these types voted >>> on. >>> > > I'm >>> > > >> coupling them together because I think there are design >>> consideration >>> > > for >>> > > >> how we evolve Schema.fbs >>> > > >> >>> > > >> There are two proposed options: >>> > > >> 1. The current PR proposal which adds a new type LargeList: >>> > > >> // List with 64-bit offsets >>> > > >> table LargeList {} >>> > > >> >>> > > >> 2. As François suggested, it might cleaner to parameterize List >>> with >>> > > >> offset width. I suppose something like: >>> > > >> >>> > > >> table List { >>> > > >> // only 32 bit and 64 bit is supported. >>> > > >> bitWidth: int = 32; >>> > > >> } >>> > > >> >>> > > >> I think Option 2 is cleaner and potentially better long-term, but >>> I >>> > > think >>> > > >> it breaks forward compatibility of the existing arrow libraries. >>> If >>> > we >>> > > >> proceed with Option 2, I would advocate making the change to >>> > Schema.fbs >>> > > all >>> > > >> at once for all types (assuming we think that 64-bit offsets are >>> > > desirable >>> > > >> for all types) along with future compatibility checks to avoid >>> > multiple >>> > > >> releases were future compatibility is broken (by broken I mean the >>> > > >> inability to detect that an implementation is receiving data it >>> can't >>> > > >> read). What are peoples thoughts on this? >>> > > >> >>> > > >> Also, any other concern with adding these types? >>> > > >> >>> > > >> Thanks, >>> > > >> Micah >>> > > >> >>> > > >> [1] https://issues.apache.org/jira/browse/ARROW-4810 >>> > > >> [2] https://issues.apache.org/jira/browse/ARROW-750 >>> > > >> [3] https://github.com/apache/arrow/pull/3848 >>> > > >> [4] >>> > > >> >>> > > >> >>> > > >>> > >>> https://github.com/apache/arrow/commit/03956cac2202139e43404d7a994508080dc2cdd1 >>> > > >> >>> > > > >>> > > >>> > >>> >>
