I could (and probably will, someday) revive that commit. At the time, though, I seemed to find that it provided little performance benefit -- the gc cost of allocating boxes was far greater (for type uncertainty involving bitstypes) and the type dispatch wasn't as much of a performance impact as I had previously assumed.
On Friday, August 1, 2014, Keno Fischer <kfisc...@college.harvard.edu> wrote: > It is possible to do generic compiler improvements for Union types > (Jameson had a branch at some point that did callsite splitting if we > inferred a Union type). However, I think the best way to go here is to > maintain the current separation of two arrays (one of the values one > for the NAs), but give an option type on access. The option type would > then most likely be in memory and wouldn't have much overhead. Please > let me know if there's anything specific I should explain how the > compiler will handle it, I admit I have only skimmed this thread. > > On Fri, Aug 1, 2014 at 9:18 AM, Simon Kornblith <si...@simonster.com > <javascript:;>> wrote: > > On Friday, August 1, 2014 6:23:59 AM UTC-4, Milan Bouchet-Valat wrote: > >> > >> Le jeudi 31 juillet 2014 à 21:19 -0700, John Myles White a écrit : > >> > >> To address Simon’s general points, which are really good reasons to > avoid > >> jumping on the Option{T} bandwagon too soon: > >> > >> > >> > >> * I agree that most languages use tagged union types for Option{T} > rather > >> than a wrapper type that contains a Boolean value. It’s also totally > true > >> that many compilers are able to make those constructs more efficient > than > >> Julia currently does. But what we should expect from Julia in the coming > >> years isn’t so clear to me. (And I personally think we need to settle > on a > >> solution for representing missing data that’s viable in a year rather > than > >> viable in five years.) This is an issue that I’d really like to have > input > >> on from Jeff, Keno, Jameson or someone else involved with the internals > of > >> the compiler. Getting input from the broader community is the main > reason I > >> wanted to put a demo of OptionTypes.jl out in front of other folks. > >> > >> > >> > >> * I’m not clear how we could come to know that a datum is not missing > >> without a resolution step that’s effectively equivalent to the get() > >> function for Option{T}. I agree that the enforced use of get() means > that > >> you can’t hope to use generic functions like sum on collections of > >> Option{T}. But I’m also not sure that’s such a bad thing: I think the > >> easiest way to express to the compiler that you know that all of the > entries > >> of a DataArray are not NA is to convert the DataArray to a straight > Array. > >> But maybe you have other mechanisms for expressing this knowledge. > Certainly > >> my proposal to do conversions to Arrays isn’t the most elegant strategy. > >> It’s just all that I’ve got so far. > >> > >> > >> > >> * I kind of like the idea of Option{T} standing outside of the main type > >> system in a kind of mirror type system. I’m less happy about Union(NA, > T) > >> being a super type of T, even though there are some good reasons that > you’d > >> like to view T as a specialization of Union(NA, T). But I agree that I > don’t > >> have a good feel about where missing data belongs in the type hierarchy. > >> This is another question for which I’d love to get input from others. > >> > >> > >> > >> In regard to Simon’s performance points: > >> > >> > >> > >> * Yes, memory usage alone argues strongly for working with DataArray{T} > >> rather than Array{Option{T}}. > >> > >> > >> > >> * Exploting tricks that make operations like anyna() faster is another > >> good argument for keeping DataArray{T} around. > >> > >> > >> > >> * I’m not sure how to deal with inlining concerns or the undefined > >> reference checks. Do you have ideas for improving this within > DataArrays or > >> do we need supporting changes in the compiler? > >> > >> Actually it seems it would be possible to make Array{Union(NAtype, T)} > >> more similar to and as efficient as DataArray{T}, by handling a few > things > >> in the compiler. This would create a generalization of DataArray to any > kind > >> of union type, which could be useful in other contexts. But more > >> importantly, it would make missing values integrate seamlessly into > Julia, > >> getting rid of any hacks. > >> > >> More specifically, the following features would need to be supported: > >> - a way of telling the compiler to store the data as two arrays of > >> concrete types (here T and NAtype), instead of as an array of boxed > values, > >> so that: > >> * efficient operations can be performed on the T values (by skipping > >> the missing ones manually) > >> * T values are stored as a dense array and can be converted to > >> Array{T} without any copy or passed to BLAS when no missing values are > >> present > >> * NA values can be packed in a BitArray to save memory and make NA > >> detection faster (see below) > >> - a fonction to check whether a given element of the array is of type T > >> rather than of NAtype (generalization of isna()) > >> - a fonction to check whether all elements of the array are of type T > >> rather than of NAtype (generalization of anyna(), more efficient than > >> calling the previous function on all elements thanks to the packing of > NAs > >> in a BitArray) > >> In this scheme, what is missing is how to allow the compiler to pack NAs > >> in a BitArray. Somehow, NAtype would have to be defined as a 1-bit > object. > >> Maybe by making it an enum-like immutable with a 1-bit field inside it? > >> > >> How does it sound? > > > > > > I've thought a bit about this, but it seems like it would be too much > > complexity in the compiler. Storing arrays as something besides > contiguous > > elements and interaction between the codegen in C/C++ and the BitArray > code > > in Julia both seem likely to be painful, although Jeff, Keno, and Jameson > > would know better than I. Additionally, this optimization (of storage of > > arrays of unions of a singleton type and a bits type) seems pretty > specific > > to DataArrays, but the actual advantages in terms of performance and > > expressibility would be small or non-existent. (This is in contrast to > > optimizing storage/dispatch with union types, which could benefit a lot > of > > code and is something a lot of languages do.) Finally, there are cases > where > > it is useful to have direct access to the na BitArray chunks beyond > anyna, > > e.g. pairwise summation and reductions across the first dimension. > > > > Simon >