What's the deal with unsafe_get? Does that segfault if you try to access a null value reference value and give you random junk if you try to access an bits value?
On Tue, Aug 12, 2014 at 1:09 AM, John Myles White <johnmyleswh...@gmail.com> wrote: > Ok, I’ve cleaned this up and renamed it to NullableTypes.jl: > https://github.com/johnmyleswhite/NullableTypes.jl > > — John > > On Aug 11, 2014, at 9:17 PM, Stefan Karpinski <ste...@karpinski.org> > wrote: > > Sounds good. We could maybe include the Nullable type in Base and thus > avoid the issue of what to call the module. If we need a module, how about > Nullables? Although, I have to say that name makes me think of The > Expendables. > > <expendables.png> > > > On Tue, Aug 12, 2014 at 12:10 AM, John Myles White < > johnmyleswh...@gmail.com> wrote: > >> Just wanted to come back to this thread now that I’m back from vacation. >> It sounds like the consensus of Jeff, Stefan, Keno and Jameson is that >> we’re better off working with an explicit Option{T} type than trying to get >> the compiler to handle Union(S, T) more efficiently. >> >> If people are willing to accept that idea, I’d like to make the use of >> Option’s a priority for a release of DataArrays that would accompany Julia >> 0.4. There’s going to be a lot of changes to Julia’s core with that >> release, so it seems like the perfect time to make some breaking changes to >> JuliaData packages. >> >> After thinking about the type hierarchy more, I’ve come to really like >> the interpretion of Option{T} as a 0-or-1 element container type. In that >> interpretation, Option{T} is to T exactly as Array{T} is to T, which is a >> relationship that has only horizontal links and no vertical links. I think >> that perspective simplifies things a lot, whie articulating the core issues >> involved with working with Option{T}. >> >> For me, the main question is whether we want Option types to be a feature >> that’s shared by people who want to express NULL pointers (i.e. ontological >> missingness) or whether we want to customize things for statistical >> missingness (i.e. epistemological missingness). As it stands, >> OptionTypes.jl implements such a bare-bones version of missingness that it >> could be safely used for either purpose. >> >> As for the naming debate, I think NullableTypes with a new type called >> Nullable{T} is the way to go. >> >> — John >> >> On Aug 1, 2014, at 1:43 PM, Stefan Karpinski <ste...@karpinski.org> >> wrote: >> >> That also strikes me as the best approach for what it's worth – just use >> option/maybe/nullable for what you return when indexing into a DataArray >> but keep the DataArray storage as two separate arrays. >> >> >> On Fri, Aug 1, 2014 at 3:29 PM, Jeff Bezanson <jeff.bezan...@gmail.com> >> wrote: >> >>> As usual, I agree with Keno :) >>> >>> We could also implement optimizations for Union(Bits,OtherBits). In >>> theory this can be stack allocated along with a boolean flag that says >>> which one it is. However to take full advantage of this it seems you >>> need to generate lots of branches with code for each case. Possible >>> but tricky. >>> >>> There is also a strong connection to the general >>> array-of-structs-to-struct-of-arrays optimization. It would not be >>> totally crazy to build this in to our object representation somehow, >>> or add hooks allowing customization of data representations, like >>> staged functions but for data instead of code. >>> >>> >>> On Fri, Aug 1, 2014 at 2:54 PM, Jameson Nash <vtjn...@gmail.com> wrote: >>> > I could (and probably will, someday) revive that commit. At the time, >>> > though, I seemed to find that it provided little performance benefit >>> -- the >>> > gc cost of allocating boxes was far greater (for type uncertainty >>> involving >>> > bitstypes) and the type dispatch wasn't as much of a performance >>> impact as I >>> > had previously assumed. >>> > >>> > >>> > On Friday, August 1, 2014, Keno Fischer <kfisc...@college.harvard.edu> >>> > wrote: >>> >> >>> >> It is possible to do generic compiler improvements for Union types >>> >> (Jameson had a branch at some point that did callsite splitting if we >>> >> inferred a Union type). However, I think the best way to go here is to >>> >> maintain the current separation of two arrays (one of the values one >>> >> for the NAs), but give an option type on access. The option type would >>> >> then most likely be in memory and wouldn't have much overhead. Please >>> >> let me know if there's anything specific I should explain how the >>> >> compiler will handle it, I admit I have only skimmed this thread. >>> >> >>> >> On Fri, Aug 1, 2014 at 9:18 AM, Simon Kornblith <si...@simonster.com> >>> >> wrote: >>> >> > On Friday, August 1, 2014 6:23:59 AM UTC-4, Milan Bouchet-Valat >>> wrote: >>> >> >> >>> >> >> Le jeudi 31 juillet 2014 à 21:19 -0700, John Myles White a écrit : >>> >> >> >>> >> >> To address Simon’s general points, which are really good reasons to >>> >> >> avoid >>> >> >> jumping on the Option{T} bandwagon too soon: >>> >> >> >>> >> >> >>> >> >> >>> >> >> * I agree that most languages use tagged union types for Option{T} >>> >> >> rather >>> >> >> than a wrapper type that contains a Boolean value. It’s also >>> totally >>> >> >> true >>> >> >> that many compilers are able to make those constructs more >>> efficient >>> >> >> than >>> >> >> Julia currently does. But what we should expect from Julia in the >>> >> >> coming >>> >> >> years isn’t so clear to me. (And I personally think we need to >>> settle >>> >> >> on a >>> >> >> solution for representing missing data that’s viable in a year >>> rather >>> >> >> than >>> >> >> viable in five years.) This is an issue that I’d really like to >>> have >>> >> >> input >>> >> >> on from Jeff, Keno, Jameson or someone else involved with the >>> internals >>> >> >> of >>> >> >> the compiler. Getting input from the broader community is the main >>> >> >> reason I >>> >> >> wanted to put a demo of OptionTypes.jl out in front of other folks. >>> >> >> >>> >> >> >>> >> >> >>> >> >> * I’m not clear how we could come to know that a datum is not >>> missing >>> >> >> without a resolution step that’s effectively equivalent to the >>> get() >>> >> >> function for Option{T}. I agree that the enforced use of get() >>> means >>> >> >> that >>> >> >> you can’t hope to use generic functions like sum on collections of >>> >> >> Option{T}. But I’m also not sure that’s such a bad thing: I think >>> the >>> >> >> easiest way to express to the compiler that you know that all of >>> the >>> >> >> entries >>> >> >> of a DataArray are not NA is to convert the DataArray to a straight >>> >> >> Array. >>> >> >> But maybe you have other mechanisms for expressing this knowledge. >>> >> >> Certainly >>> >> >> my proposal to do conversions to Arrays isn’t the most elegant >>> >> >> strategy. >>> >> >> It’s just all that I’ve got so far. >>> >> >> >>> >> >> >>> >> >> >>> >> >> * I kind of like the idea of Option{T} standing outside of the main >>> >> >> type >>> >> >> system in a kind of mirror type system. I’m less happy about >>> Union(NA, >>> >> >> T) >>> >> >> being a super type of T, even though there are some good reasons >>> that >>> >> >> you’d >>> >> >> like to view T as a specialization of Union(NA, T). But I agree >>> that I >>> >> >> don’t >>> >> >> have a good feel about where missing data belongs in the type >>> >> >> hierarchy. >>> >> >> This is another question for which I’d love to get input from >>> others. >>> >> >> >>> >> >> >>> >> >> >>> >> >> In regard to Simon’s performance points: >>> >> >> >>> >> >> >>> >> >> >>> >> >> * Yes, memory usage alone argues strongly for working with >>> DataArray{T} >>> >> >> rather than Array{Option{T}}. >>> >> >> >>> >> >> >>> >> >> >>> >> >> * Exploting tricks that make operations like anyna() faster is >>> another >>> >> >> good argument for keeping DataArray{T} around. >>> >> >> >>> >> >> >>> >> >> >>> >> >> * I’m not sure how to deal with inlining concerns or the undefined >>> >> >> reference checks. Do you have ideas for improving this within >>> >> >> DataArrays or >>> >> >> do we need supporting changes in the compiler? >>> >> >> >>> >> >> Actually it seems it would be possible to make Array{Union(NAtype, >>> T)} >>> >> >> more similar to and as efficient as DataArray{T}, by handling a few >>> >> >> things >>> >> >> in the compiler. This would create a generalization of DataArray >>> to any >>> >> >> kind >>> >> >> of union type, which could be useful in other contexts. But more >>> >> >> importantly, it would make missing values integrate seamlessly into >>> >> >> Julia, >>> >> >> getting rid of any hacks. >>> >> >> >>> >> >> More specifically, the following features would need to be >>> supported: >>> >> >> - a way of telling the compiler to store the data as two arrays of >>> >> >> concrete types (here T and NAtype), instead of as an array of boxed >>> >> >> values, >>> >> >> so that: >>> >> >> * efficient operations can be performed on the T values (by >>> >> >> skipping >>> >> >> the missing ones manually) >>> >> >> * T values are stored as a dense array and can be converted to >>> >> >> Array{T} without any copy or passed to BLAS when no missing values >>> are >>> >> >> present >>> >> >> * NA values can be packed in a BitArray to save memory and make >>> NA >>> >> >> detection faster (see below) >>> >> >> - a fonction to check whether a given element of the array is of >>> type T >>> >> >> rather than of NAtype (generalization of isna()) >>> >> >> - a fonction to check whether all elements of the array are of >>> type T >>> >> >> rather than of NAtype (generalization of anyna(), more efficient >>> than >>> >> >> calling the previous function on all elements thanks to the >>> packing of >>> >> >> NAs >>> >> >> in a BitArray) >>> >> >> In this scheme, what is missing is how to allow the compiler to >>> pack >>> >> >> NAs >>> >> >> in a BitArray. Somehow, NAtype would have to be defined as a 1-bit >>> >> >> object. >>> >> >> Maybe by making it an enum-like immutable with a 1-bit field >>> inside it? >>> >> >> >>> >> >> How does it sound? >>> >> > >>> >> > >>> >> > I've thought a bit about this, but it seems like it would be too >>> much >>> >> > complexity in the compiler. Storing arrays as something besides >>> >> > contiguous >>> >> > elements and interaction between the codegen in C/C++ and the >>> BitArray >>> >> > code >>> >> > in Julia both seem likely to be painful, although Jeff, Keno, and >>> >> > Jameson >>> >> > would know better than I. Additionally, this optimization (of >>> storage of >>> >> > arrays of unions of a singleton type and a bits type) seems pretty >>> >> > specific >>> >> > to DataArrays, but the actual advantages in terms of performance and >>> >> > expressibility would be small or non-existent. (This is in contrast >>> to >>> >> > optimizing storage/dispatch with union types, which could benefit a >>> lot >>> >> > of >>> >> > code and is something a lot of languages do.) Finally, there are >>> cases >>> >> > where >>> >> > it is useful to have direct access to the na BitArray chunks beyond >>> >> > anyna, >>> >> > e.g. pairwise summation and reductions across the first dimension. >>> >> > >>> >> > Simon >>> >> >> >> > >