I think that this is unavoidable. Even with batches, taking an example of a binary column where the mean size of the payload is 1mb, it limits to batches of 2048 elements. This can become annoying pretty quickly.
François On Fri, Apr 12, 2019 at 11:15 PM Wes McKinney <wesmck...@gmail.com> wrote: > Hi Jacques, > > I think there are different use cases. What we are seeing now in many > places is the desire to use the Arrow format to represent very large > on-disk datasets (eg memory-mapped). If we don't support this use case it > will continue to cause tension in adoption and result in some applications > choosing to not use Arrow for things, with subsequent ecosystem > fragmentation. The idea of breaking work into smaller batches seems like an > analytics / database- centric viewpoint and the project has already grown > significantly beyond that (eg considering uses for Plasma in HPC / machine > learning applications). > > There is a stronger argument IMHO for LargeBinary compared with LargeList > because opaque objects may routinely exceed 2GB. I have just developed an > ExtensionType facility in C++ (which I hope to see implemented soon in Java > and elsewhere) to make this more useful / natural at an API level (where > objects can be automatically unboxed into user types). > > Personally I would rather address the 64-bit offset issue now so that I > stop hearing the objection from would-be users (I can count a dozen or so > occasions where I've been accosted in person over this issue at conferences > and elsewhere). It would be a good idea to recommend a preference for > 32-bit offsets in our documentation. > > Wes > > > On Fri, Apr 12, 2019, 10:35 PM Jacques Nadeau <jacq...@apache.org> wrote: > > > Definitely prefer option 1. > > > > I'm a -0.5 on the change in general. I think that early on users may want > > to pattern things this way but as you start trying to parallelize work, > > pipeline work, etc, moving beyond moderate batch sizes is ultimately a > > different use case and won't be supported well within code that is > > expecting to work with smaller data structures. A good example might be > > doing a join between two datasets that will not fit in memory. Trying to > > solve that with individual cells and records of the size proposed here is > > probably an extreme edge case and thus won't be handled in the algorithms > > people will implement. So you'll effectively get into this situation of > > second-class datatypes that really aren't supported by most things. > > > > On Thu, Apr 11, 2019 at 2:06 PM Philipp Moritz <pcmor...@gmail.com> > wrote: > > > > > Thanks for getting the discussion started, Micah! > > > > > > I'm +1 on this change and also slightly prefer 1. As Antoine mentions, > > > there doesn't seem to be a clear benefit from 2, unless we want to also > > > support 8 or 16 bit indices in the future, which seems unlikely. So > going > > > with 1 is ok I think. > > > > > > Best, > > > Philipp. > > > > > > On Thu, Apr 11, 2019 at 7:06 AM Antoine Pitrou <anto...@python.org> > > wrote: > > > > > > > > > > > Le 11/04/2019 à 10:52, Micah Kornfield a écrit : > > > > > ARROW-4810 [1] and ARROW-750 [2] discuss adding types with 64-bit > > > offsets > > > > > to Lists, Strings and binary data types. > > > > > > > > > > Philipp started an implementation for the large list type [3] and I > > > > hacked > > > > > together a potentially viable java implementation [4] > > > > > > > > > > I'd like to kickoff the discussion for getting these types voted > on. > > > I'm > > > > > coupling them together because I think there are design > consideration > > > for > > > > > how we evolve Schema.fbs > > > > > > > > > > There are two proposed options: > > > > > 1. The current PR proposal which adds a new type LargeList: > > > > > // List with 64-bit offsets > > > > > table LargeList {} > > > > > > > > > > 2. As François suggested, it might cleaner to parameterize List > with > > > > > offset width. I suppose something like: > > > > > > > > > > table List { > > > > > // only 32 bit and 64 bit is supported. > > > > > bitWidth: int = 32; > > > > > } > > > > > > > > > > I think Option 2 is cleaner and potentially better long-term, but I > > > think > > > > > it breaks forward compatibility of the existing arrow libraries. > If > > we > > > > > proceed with Option 2, I would advocate making the change to > > Schema.fbs > > > > all > > > > > at once for all types (assuming we think that 64-bit offsets are > > > > desirable > > > > > for all types) along with future compatibility checks to avoid > > multiple > > > > > releases were future compatibility is broken (by broken I mean the > > > > > inability to detect that an implementation is receiving data it > can't > > > > > read). What are peoples thoughts on this? > > > > > > > > I think Option 1 is ok. Making List / String / Binary > parameterizable > > > > doesn't bring anything *concretely*, since the types will not be > > > > physically interchangeable. The cost of breaking compatibility > should > > > > be offset by a compelling benefit, which doesn't seem to exist here. > > > > > > > > Of course, implementations are free to refactor their internals to > > avoid > > > > code duplication (for example the C++ ListBuilder and > LargeListBuilder > > > > classes could be instances of a BaseListBuilder<IndexType> generic > > > type)... > > > > > > > > Regards > > > > > > > > Antoine. > > > > > > > > > >