+1 to a vote (I can kick it off today). I had planned to respond earlier this week but jet lag and e-mail backlog got the better of me
On Thu, Apr 25, 2019 at 3:09 AM Antoine Pitrou <anto...@python.org> wrote: > > > Voting sounds like a good idea to me. > > Regards > > Antoine. > > > Le 25/04/2019 à 06:16, Micah Kornfield a écrit : > > Given that conversation seems to have died down on this, would it make > > sense to do a vote to allow for large variable width types to be added? As > > discussed previously PRs would need both C++ and Java implementation before > > being merged. > > > > Could a PMC member facilitate this? > > > > Philipp if approved, do you have bandwidth to finish up the PR for > > LargeList? > > > > Thanks, > > Micah > > > > On Mon, Apr 15, 2019 at 11:16 PM Philipp Moritz <pcmor...@gmail.com> wrote: > > > >> @Micah: I wanted to make it possible to support serializing large objects > >> (existing large pandas dataframes with an "object" column and also large > >> python types with the pyarrow serialization). > >> > >> On Mon, Apr 15, 2019 at 8:22 PM Micah Kornfield <emkornfi...@gmail.com> > >> wrote: > >> > >>> To summarize my understanding of the thread so far, there seems to be > >>> consensus on having a new distinct type for each "large" type. > >>> > >>> There are some reservations around the "large" types being harder to > >>> support in algorithmic implementations. > >>> > >>> I'm curious Philipp, was there a concrete use-case that inspired you to > >>> start the PR? > >>> > >>> Also, this was brought up on another thread, but utility of the "large" > >>> types might be limited in some languages (e.g. Java) until they support > >>> buffer sizes larger then INT_MAX bytes. I brought this up on the current > >>> PR to decouple Netty and memory management from ArrowBuf [1], but the > >>> consensus seems to be to handle any modifications in follow-up PRs (if > >>> they > >>> are agreed upon). > >>> > >>> Anything else people want to discuss before a vote on whether to allow > >>> the additional types into the spec? > >>> > >>> Thanks, > >>> Micah > >>> > >>> [1] https://github.com/apache/arrow/pull/4151 > >>> > >>> > >>> > >>> > >>> On Monday, April 15, 2019, Jacques Nadeau <jacq...@apache.org> wrote: > >>> > >>>> I am not Jacques, but I will try to give my own point of view on this. > >>>>> > >>>> > >>>> Thanks for making me laugh :) > >>>> > >>>> I think that this is unavoidable. Even with batches, taking an example > >>>> of a > >>>>> binary column where the mean size of the payload is 1mb, it limits to > >>>>> batches of 2048 elements. This can become annoying pretty quickly. > >>>>> > >>>> > >>>> Good example. I'm not sure columnar matters but I find it more useful > >>>> than > >>>> others. > >>>> > >>>> logical types and physical types > >>>>> > >>>> > >>>> TLDR; It is painful no matter which model you pick. > >>>> > >>>> I definitely think we worked hard to go different on Arrow than Parquet. > >>>> It > >>>> was something I pushed consciously when we started as I found some of the > >>>> patterns in Parquet to be quite challenging. Unfortunately, we went too > >>>> far > >>>> in some places in the Java code which tried to parallel the structure of > >>>> the physical types directly (and thus the big refactor we did to reduce > >>>> duplication last year -- props to Sidd, Bryan and the others who worked > >>>> on > >>>> that). I also think that we easily probably lost as much as we gained > >>>> using > >>>> the current model. > >>>> > >>>> I agree with Antoine both in his clean statement of the approaches and > >>>> that > >>>> sticking to the model we have today makes the most sense. > >>>> > >>>> On Mon, Apr 15, 2019 at 11:05 AM Francois Saint-Jacques < > >>>> fsaintjacq...@gmail.com> wrote: > >>>> > >>>>> Thanks for the clarification Antoine, very insightful. > >>>>> > >>>>> I'd also vote for keeping the existing model for consistency. > >>>>> > >>>>> On Mon, Apr 15, 2019 at 1:40 PM Antoine Pitrou <anto...@python.org> > >>>> wrote: > >>>>> > >>>>>> > >>>>>> Hi, > >>>>>> > >>>>>> I am not Jacques, but I will try to give my own point of view on > >>>> this. > >>>>>> > >>>>>> The distinction between logical and physical types can be modelled in > >>>>>> two different ways: > >>>>>> > >>>>>> 1) a physical type can denote several logical types, but a logical > >>>> type > >>>>>> can only have a single physical representation. This is currently > >>>> the > >>>>>> Arrow model. > >>>>>> > >>>>>> 2) a physical type can denote several logical types, and a logical > >>>> type > >>>>>> can also be denoted by several physical types. This is the Parquet > >>>>> model. > >>>>>> > >>>>>> (theoretically, there are two other possible models, but they are not > >>>>>> very interesting to consider, since they don't seem to cater to > >>>> concrete > >>>>>> use cases) > >>>>>> > >>>>>> Model 1 is obviously more restrictive, while model 2 is more > >>>> flexible. > >>>>>> Model 2 could be said "higher level"; you see something similar if > >>>> you > >>>>>> compare Python's and C++'s typing systems. On the other hand, model > >>>> 1 > >>>>>> provides a potentially simpler programming model for implementors of > >>>>>> low-level kernels, as you can simply query the logical type of your > >>>> data > >>>>>> and you automatically know its physical type. > >>>>>> > >>>>>> The model chosen for Arrow is ingrained in its API. If we want to > >>>>>> change the model we'd better do it wholesale (implying probably a > >>>> large > >>>>>> refactoring and a significant number of unavoidable regressions) to > >>>>>> avoid subjecting users to a confusing middle point. > >>>>>> > >>>>>> Also and as a sidenote, "convertibility" between different types can > >>>> be > >>>>>> a hairy subject... Having strict boundaries between types avoids > >>>> being > >>>>>> dragged into it too early. > >>>>>> > >>>>>> > >>>>>> To return to the original subject: IMHO, LargeList (resp. > >>>> LargeBinary) > >>>>>> should be a distinct logical type from List (resp. Binary), the same > >>>> way > >>>>>> Int64 is a distinct logical type from Int32. > >>>>>> > >>>>>> Regards > >>>>>> > >>>>>> Antoine. > >>>>>> > >>>>>> > >>>>>> > >>>>>> Le 15/04/2019 à 18:45, Francois Saint-Jacques a écrit : > >>>>>>> Hello, > >>>>>>> > >>>>>>> I would like understand where do we stand on logical types and > >>>> physical > >>>>>>> types. As I understand, this proposal is for the physical > >>>>> representation. > >>>>>>> > >>>>>>> In the context of an execution engine, the concept of logical types > >>>>>> becomes > >>>>>>> more important as two physical representation might have the same > >>>>>> semantical > >>>>>>> values, e.g. LargeList and List where all values fits in the > >>>> 32-bits. > >>>>> A > >>>>>>> more > >>>>>>> complex example would be an Integer array and a dictionary array > >>>> where > >>>>>>> values > >>>>>>> are integers. > >>>>>>> > >>>>>>> Is this something only something only relevant for execution > >>>> engine? > >>>>> What > >>>>>>> about > >>>>>>> the (C++) Array.Equals method and related comparisons methods? This > >>>>> also > >>>>>>> touch > >>>>>>> the subject of type equality, e.g. dict with different but > >>>> compatible > >>>>>>> encoding. > >>>>>>> > >>>>>>> Jacques, knowing that you worked on Parquet (which follows this > >>>> model) > >>>>>> and > >>>>>>> Dremio, > >>>>>>> what is your opinion? > >>>>>>> > >>>>>>> François > >>>>>>> > >>>>>>> Some related tickets: > >>>>>>> - https://jira.apache.org/jira/browse/ARROW-554 > >>>>>>> - https://jira.apache.org/jira/browse/ARROW-1741 > >>>>>>> - https://jira.apache.org/jira/browse/ARROW-3144 > >>>>>>> - https://jira.apache.org/jira/browse/ARROW-4097 > >>>>>>> - https://jira.apache.org/jira/browse/ARROW-5052 > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Thu, Apr 11, 2019 at 4:52 AM Micah Kornfield < > >>>> emkornfi...@gmail.com > >>>>>> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> ARROW-4810 [1] and ARROW-750 [2] discuss adding types with 64-bit > >>>>>> offsets > >>>>>>>> to Lists, Strings and binary data types. > >>>>>>>> > >>>>>>>> Philipp started an implementation for the large list type [3] and > >>>> I > >>>>>> hacked > >>>>>>>> together a potentially viable java implementation [4] > >>>>>>>> > >>>>>>>> I'd like to kickoff the discussion for getting these types voted > >>>> on. > >>>>>> I'm > >>>>>>>> coupling them together because I think there are design > >>>> consideration > >>>>>> for > >>>>>>>> how we evolve Schema.fbs > >>>>>>>> > >>>>>>>> There are two proposed options: > >>>>>>>> 1. The current PR proposal which adds a new type LargeList: > >>>>>>>> // List with 64-bit offsets > >>>>>>>> table LargeList {} > >>>>>>>> > >>>>>>>> 2. As François suggested, it might cleaner to parameterize List > >>>> with > >>>>>>>> offset width. I suppose something like: > >>>>>>>> > >>>>>>>> table List { > >>>>>>>> // only 32 bit and 64 bit is supported. > >>>>>>>> bitWidth: int = 32; > >>>>>>>> } > >>>>>>>> > >>>>>>>> I think Option 2 is cleaner and potentially better long-term, but > >>>> I > >>>>>> think > >>>>>>>> it breaks forward compatibility of the existing arrow libraries. > >>>> If > >>>>> we > >>>>>>>> proceed with Option 2, I would advocate making the change to > >>>>> Schema.fbs > >>>>>> all > >>>>>>>> at once for all types (assuming we think that 64-bit offsets are > >>>>>> desirable > >>>>>>>> for all types) along with future compatibility checks to avoid > >>>>> multiple > >>>>>>>> releases were future compatibility is broken (by broken I mean the > >>>>>>>> inability to detect that an implementation is receiving data it > >>>> can't > >>>>>>>> read). What are peoples thoughts on this? > >>>>>>>> > >>>>>>>> Also, any other concern with adding these types? > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> Micah > >>>>>>>> > >>>>>>>> [1] https://issues.apache.org/jira/browse/ARROW-4810 > >>>>>>>> [2] https://issues.apache.org/jira/browse/ARROW-750 > >>>>>>>> [3] https://github.com/apache/arrow/pull/3848 > >>>>>>>> [4] > >>>>>>>> > >>>>>>>> > >>>>>> > >>>>> > >>>> https://github.com/apache/arrow/commit/03956cac2202139e43404d7a994508080dc2cdd1 > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >