I also haven't read this thread carefully, but it does seem that in this case one needs to automatically mutate code like this:
x = A[5] if isa(x, BadType) ... elseif isa(x, GoodType) ... # do something with x end into dt = peektype(A, 5) # gets the type of A[5] without evaluating it if dt <: BadType ... elseif dt <: GoodType x = A[5] ... # do something with x end so that dt can be type-stable. --Tim On Friday, August 01, 2014 02:54:47 PM Jameson Nash wrote: > I could (and probably will, someday) revive that commit. At the time, > though, I seemed to find that it provided little performance benefit -- the > gc cost of allocating boxes was far greater (for type uncertainty involving > bitstypes) and the type dispatch wasn't as much of a performance impact as > I had previously assumed. > > > On Friday, August 1, 2014, Keno Fischer <kfisc...@college.harvard.edu> > > wrote: > > It is possible to do generic compiler improvements for Union types > > (Jameson had a branch at some point that did callsite splitting if we > > inferred a Union type). However, I think the best way to go here is to > > maintain the current separation of two arrays (one of the values one > > for the NAs), but give an option type on access. The option type would > > then most likely be in memory and wouldn't have much overhead. Please > > let me know if there's anything specific I should explain how the > > compiler will handle it, I admit I have only skimmed this thread. > > > > On Fri, Aug 1, 2014 at 9:18 AM, Simon Kornblith <si...@simonster.com > > > > <javascript:;>> wrote: > > > On Friday, August 1, 2014 6:23:59 AM UTC-4, Milan Bouchet-Valat wrote: > > >> Le jeudi 31 juillet 2014 à 21:19 -0700, John Myles White a écrit : > > >> > > >> To address Simon’s general points, which are really good reasons to > > > > avoid > > > > >> jumping on the Option{T} bandwagon too soon: > > >> > > >> > > >> > > >> * I agree that most languages use tagged union types for Option{T} > > > > rather > > > > >> than a wrapper type that contains a Boolean value. It’s also totally > > > > true > > > > >> that many compilers are able to make those constructs more efficient > > > > than > > > > >> Julia currently does. But what we should expect from Julia in the > > >> coming > > >> years isn’t so clear to me. (And I personally think we need to settle > > > > on a > > > > >> solution for representing missing data that’s viable in a year rather > > > > than > > > > >> viable in five years.) This is an issue that I’d really like to have > > > > input > > > > >> on from Jeff, Keno, Jameson or someone else involved with the internals > > > > of > > > > >> the compiler. Getting input from the broader community is the main > > > > reason I > > > > >> wanted to put a demo of OptionTypes.jl out in front of other folks. > > >> > > >> > > >> > > >> * I’m not clear how we could come to know that a datum is not missing > > >> without a resolution step that’s effectively equivalent to the get() > > >> function for Option{T}. I agree that the enforced use of get() means > > > > that > > > > >> you can’t hope to use generic functions like sum on collections of > > >> Option{T}. But I’m also not sure that’s such a bad thing: I think the > > >> easiest way to express to the compiler that you know that all of the > > > > entries > > > > >> of a DataArray are not NA is to convert the DataArray to a straight > > > > Array. > > > > >> But maybe you have other mechanisms for expressing this knowledge. > > > > Certainly > > > > >> my proposal to do conversions to Arrays isn’t the most elegant > > >> strategy. > > >> It’s just all that I’ve got so far. > > >> > > >> > > >> > > >> * I kind of like the idea of Option{T} standing outside of the main > > >> type > > >> system in a kind of mirror type system. I’m less happy about Union(NA, > > > > T) > > > > >> being a super type of T, even though there are some good reasons that > > > > you’d > > > > >> like to view T as a specialization of Union(NA, T). But I agree that I > > > > don’t > > > > >> have a good feel about where missing data belongs in the type > > >> hierarchy. > > >> This is another question for which I’d love to get input from others. > > >> > > >> > > >> > > >> In regard to Simon’s performance points: > > >> > > >> > > >> > > >> * Yes, memory usage alone argues strongly for working with DataArray{T} > > >> rather than Array{Option{T}}. > > >> > > >> > > >> > > >> * Exploting tricks that make operations like anyna() faster is another > > >> good argument for keeping DataArray{T} around. > > >> > > >> > > >> > > >> * I’m not sure how to deal with inlining concerns or the undefined > > >> reference checks. Do you have ideas for improving this within > > > > DataArrays or > > > > >> do we need supporting changes in the compiler? > > >> > > >> Actually it seems it would be possible to make Array{Union(NAtype, T)} > > >> more similar to and as efficient as DataArray{T}, by handling a few > > > > things > > > > >> in the compiler. This would create a generalization of DataArray to any > > > > kind > > > > >> of union type, which could be useful in other contexts. But more > > >> importantly, it would make missing values integrate seamlessly into > > > > Julia, > > > > >> getting rid of any hacks. > > >> > > >> More specifically, the following features would need to be supported: > > >> - a way of telling the compiler to store the data as two arrays of > > >> concrete types (here T and NAtype), instead of as an array of boxed > > > > values, > > > > >> so that: > > >> * efficient operations can be performed on the T values (by > > >> skipping > > >> > > >> the missing ones manually) > > >> > > >> * T values are stored as a dense array and can be converted to > > >> > > >> Array{T} without any copy or passed to BLAS when no missing values are > > >> present > > >> > > >> * NA values can be packed in a BitArray to save memory and make NA > > >> > > >> detection faster (see below) > > >> - a fonction to check whether a given element of the array is of type T > > >> rather than of NAtype (generalization of isna()) > > >> - a fonction to check whether all elements of the array are of type T > > >> rather than of NAtype (generalization of anyna(), more efficient than > > >> calling the previous function on all elements thanks to the packing of > > > > NAs > > > > >> in a BitArray) > > >> In this scheme, what is missing is how to allow the compiler to pack > > >> NAs > > >> in a BitArray. Somehow, NAtype would have to be defined as a 1-bit > > > > object. > > > > >> Maybe by making it an enum-like immutable with a 1-bit field inside it? > > >> > > >> How does it sound? > > > > > > I've thought a bit about this, but it seems like it would be too much > > > complexity in the compiler. Storing arrays as something besides > > > > contiguous > > > > > elements and interaction between the codegen in C/C++ and the BitArray > > > > code > > > > > in Julia both seem likely to be painful, although Jeff, Keno, and > > > Jameson > > > would know better than I. Additionally, this optimization (of storage of > > > arrays of unions of a singleton type and a bits type) seems pretty > > > > specific > > > > > to DataArrays, but the actual advantages in terms of performance and > > > expressibility would be small or non-existent. (This is in contrast to > > > optimizing storage/dispatch with union types, which could benefit a lot > > > > of > > > > > code and is something a lot of languages do.) Finally, there are cases > > > > where > > > > > it is useful to have direct access to the na BitArray chunks beyond > > > > anyna, > > > > > e.g. pairwise summation and reductions across the first dimension. > > > > > > Simon