Le jeudi 31 juillet 2014 à 21:19 -0700, John Myles White a écrit : > To address Simon’s general points, which are really good reasons to > avoid jumping on the Option{T} bandwagon too soon: > > > > * I agree that most languages use tagged union types for Option{T} > rather than a wrapper type that contains a Boolean value. It’s also > totally true that many compilers are able to make those constructs > more efficient than Julia currently does. But what we should expect > from Julia in the coming years isn’t so clear to me. (And I personally > think we need to settle on a solution for representing missing data > that’s viable in a year rather than viable in five years.) This is an > issue that I’d really like to have input on from Jeff, Keno, Jameson > or someone else involved with the internals of the compiler. Getting > input from the broader community is the main reason I wanted to put a > demo of OptionTypes.jl out in front of other folks. > > > * I’m not clear how we could come to know that a datum is not missing > without a resolution step that’s effectively equivalent to the get() > function for Option{T}. I agree that the enforced use of get() means > that you can’t hope to use generic functions like sum on collections > of Option{T}. But I’m also not sure that’s such a bad thing: I think > the easiest way to express to the compiler that you know that all of > the entries of a DataArray are not NA is to convert the DataArray to a > straight Array. But maybe you have other mechanisms for expressing > this knowledge. Certainly my proposal to do conversions to Arrays > isn’t the most elegant strategy. It’s just all that I’ve got so far. > > > * I kind of like the idea of Option{T} standing outside of the main > type system in a kind of mirror type system. I’m less happy about > Union(NA, T) being a super type of T, even though there are some good > reasons that you’d like to view T as a specialization of Union(NA, T). > But I agree that I don’t have a good feel about where missing data > belongs in the type hierarchy. This is another question for which I’d > love to get input from others.
Either with Option{T} or with Union(NAtype, T), nullable types create a mirror type hierarchy. But in the latter case, there are bridges between the two hierarchies: - horizontally, T <: Union(NAtype, T) - vertically (in diagonal), T2 <: Union(NAtype, T1) when T2 <: T1 For example, for Real and Float64: Real <- Union(NAtype, Real) | / | | / | | / | Float64 <- Union(NAtype, Float64) I see this as a relatively nice definition of the type hierarchy for missing data. Thus I agree with Simon that using a type union if possible would be better than a specific option type which would obscure the type hierarchy -- which is really essential in Julia. Let us see what people working on the compiler say. But thanks for tackling this John, it's a very interesting investigation and it would be great to be able to handle missing data in a more Julian fashion! Regards > > In regard to Simon’s performance points: > > > * Yes, memory usage alone argues strongly for working with > DataArray{T} rather than Array{Option{T}}. > > > * Exploting tricks that make operations like anyna() faster is another > good argument for keeping DataArray{T} around. > > > * I’m not sure how to deal with inlining concerns or the undefined > reference checks. Do you have ideas for improving this within > DataArrays or do we need supporting changes in the compiler? > > > — John > > > On Jul 31, 2014, at 4:49 PM, Simon Kornblith <si...@simonster.com> > wrote: > > > > > My previous comments on option types versus union types for > > DataArray indexing are here. To elaborate on John's summary above, > > my present views are: > > * In many other language, including languages that are > > lower-level than Julia such as Rust, option types are union > > types and it's the compiler's job to make them fast. Making > > union types fast would simply mean that the presence > > of Union in the output of code_typed is no longer to be > > dreaded. > > * The option type approach forces special handling of data > > that could potentially be missing but isn't, and this breaks > > generic code. I feel it is better if, once we know a datum > > is not missing, it has the same concrete type as if it were > > known to exist in advance. > > * I'm not sure how missing data ought to fit into the type > > hierarchy, but neither the Union(NA, T) nor Option{T} > > approaches seem completely ideal. > > Some more specific notes WRT performance: > > * Storing option types in an Array{Option{T}} would require > > nearly twice as much memory as storing them in a DataArray > > in most cases. While a Bool is only 1 byte, there is > > additional padding needed for alignment. On x86_64 for types > > <= 64 bits the alignment is usually the size of the type. > > * As long as option types live in registers and not in memory, > > they should consume no additional space as compared with > > scalar indexing into both the data and na arrays of a > > DataArray, as indexing into the na array would already > > require a register. There are, however, some cases where the > > BitArray representation of NAs can be exploited for > > performance. In John's example code forsum above with dropna > > = false above, the Option type approach would read every bit > > of the naarray individually. It is faster to first check if > > any values are NA, which can be done 64 bits at a time, and > > throw an error if there are NAs or sum the values if not. > > * There is other overhead related to DataArray indexing that > > needs to be addressed, notably that scalar indexing needs to > > be inlined (as well as isna and get, if we're using option > > types), and that accessing the data and na arrays currently > > incurs an undefined reference check. > > Simon > > > > On Thursday, July 31, 2014 12:51:15 PM UTC-4, John Myles White > > wrote: > > > > Harlan, I don't think your assumption about performance is > > necessarily correct: it's sometimes the case that you can > > get *better* performance by working with bytes rather than > > bits, since bytes are the primitive objects of computation. > > For that reason, I think the performance implications of > > using a full Bool are very sensitive to the exact > > computations being performed. > > > > In general, I don't think the expansion of a BitArray bit > > into a Boolean is a big issue for most data analysis tasks. > > As evidence, I'd note that expanding a bit to a Bool is > > certainly not worse than the cost of translating a single > > byte from an ASCIIString or UTF8String object into a > > Char object when doing iteration over strings. If you think > > that decision for Base Julia is reasonable, I think the > > decision to use Option is also defensible for similar > > reasons. > > > > As for integration into the current system, my thinking is > > that DataArrays would be changed so that OptionTypes would > > be generated whenever you attempt to access a scalar element > > of an AbstractDataArray. The following example shows how > > that might work: > > > > function mean(a::AbstractDataArray, dropna::Boolean = false) > > sum, n = 0.0, 0 > > if dropna > > for i in 1:length(a) > > o = a[i] > > if !isna(o) > > sum += get(o) > > n += 1 > > end > > end > > else > > for i in 1:length(a) > > sum += get(o) > > n += 1 > > end > > end > > return sum / n > > end > > > > One could also define this function so that it always > > returns an Option type, rather than a direct Float64 value. > > > > > > For special types like Float64, you could even computes > > means while returning NaN's using the default values > > interface to `get`: > > > > function mean{T <: FloatingPoint}(a::AbstractDataArray{T}) > > sum, n = 0.0, 0 > > for i in 1:length(a) > > sum += get(o, nan(T)) > > n += 1 > > end > > return sum / n > > end > > > > Unlike our current system, the use of OptionTypes would > > provide acceptable performance without requiring that users > > break abstraction barriers. This is the big gain: Option{T} > > exploits Julia's current type system / compiler to express > > the same ideas as NAtype does in an efficient way. Options > > wouldn't need to be boxed the way that our Union types > > currently are being boxed because their type could be easily > > inferred by the compiler. > > > > If you're not familiar with the requirement for breaking our > > current abstraction barriers, note that indexing into a > > DataArray at the moment poisons the performance of every > > program you write, because the result has an uncertain type > > that the compiler doesn't know how to optimize. To work > > around this, you have to write code that effectively > > accesses the raw .na and .data fields of a DataArray. > > Simon's done some great work to make this easier to do, but > > I'm not sure that's the right direction for us to head in > > the long-run. > > > > In general, I think an approach to missing data built on the > > combination of Option{T} and DataArray{T} provides an > > interface that's simple and consistent (everything happens > > in terms of isna/get) even if it's an interface that's > > somewhat unfamilar to R/Python folks. Most important to me > > is that using Option types efficiently doesn't require a > > deep understanding of Julia's type system, whereas our > > current abstractions require you to understand how to work > > around problems raised by Union types when programming for > > the current compiler. An Option type is basically a forcing > > function that says: "Julia is aggressively typed. If you > > want to work with a missing value of type T, you need to > > explicitly say how you're going to handle any missingness so > > that the system only interacts with values of type T." > > > > I should note that I'm not very sure the use of Options is > > the right approach: Simon Kornblith has argued very > > persuasively for waiting for the compiler to improve its > > ability to handle tagged unions like those generated by > > indexing into DataArrays. My personal feeling is that it's > > easier to expose a simple abstraction that doesn't assume > > the compiler will change dramatically. This is based on my > > aesthetic sense that Julia's power comes from making > > optimizations transparent and explicit, rather than making > > the compiler smarter. > > > > It's also worth noting that there are problems for which the > > use of Option types isn't very helpful: the computation of > > medians, for example, isn't defined in terms of scalars, so > > having a better abstraction for expressing missing scalars > > won't get us anywhere. > > > > One other caveat is that I'm not providing an `unsafe_get` > > method, which means that the `isna` followed by `get` idiom > > I showed above does two checks when you could get away with > > only one. I still haven't figured out how I'd like to handle > > that issue. > > > > > > So there's still a lot of design work needed, but I wanted > > to let people see how this interface could work were we to > > choose it. > > > > > > -- John > > > > > > On Jul 31, 2014, at 8:15 AM, Harlan Harris > > <har...@harris.name> wrote: > > > > > > > > > John, how might this interact with DataArrays? This > > > design, unlike DataArrays, requires that you use an entire > > > byte to store the missingness, so it's not likely to be as > > > fast if you're manipulating a lot of them. Is the intended > > > use case here something sum(DataArray) yields > > > Option{Float64}? > > > > > > > > > > > > On Thu, Jul 31, 2014 at 11:01 AM, Stefan > > > Karpinski <ste...@karpinski.org> wrote: > > > > > > This looks quite promising. The `get` interface > > > looks like a very nice way to generically deal > > > with missing values – provide a default or get an > > > error. Automatic conversion of the default to the > > > type of the Option value is particularly nice. > > > This seems like it will be a pleasant and > > > efficient API. > > > > > > > > > > > > Minor question: how come the non-NA constructor > > > for Option takes both the `na` and `value` > > > arguments? Doesn't supplying a value imply that > > > it's not NA while not supplying a value implies > > > that it is NA? > > > > > > > > > > > > On Thursday, July 31, 2014, John Myles White > > > <johnmyl...@gmail.com> wrote: > > > > > > At JuliaCon, I described the idea of using > > > Option types instead of NAtype to make it > > > easier to write type-stable code that > > > interacts with NA’s. To help facilitate a > > > debate about the utility of this approach, > > > I just wrote up a minimal package that > > > implements Option > > > types: > > https://github.com/johnmyleswhite/OptionTypes.jl > > > > > > Essentially, an Option type is a scalar > > > version of a DataArray: it contains a > > > value and a Boolean mask indicating > > > whether that value is NA or not. Because > > > Option types are parametric, they allow us > > > to express the variants of NA that R has, > > > such as a Float64 specific NA and an Int64 > > > specific NA. > > > > > > — John > > >