Le jeudi 31 juillet 2014 à 21:19 -0700, John Myles White a écrit :

> To address Simon’s general points, which are really good reasons to
> avoid jumping on the Option{T} bandwagon too soon:
> 
> 
> 
> * I agree that most languages use tagged union types for Option{T}
> rather than a wrapper type that contains a Boolean value. It’s also
> totally true that many compilers are able to make those constructs
> more efficient than Julia currently does. But what we should expect
> from Julia in the coming years isn’t so clear to me. (And I personally
> think we need to settle on a solution for representing missing data
> that’s viable in a year rather than viable in five years.) This is an
> issue that I’d really like to have input on from Jeff, Keno, Jameson
> or someone else involved with the internals of the compiler. Getting
> input from the broader community is the main reason I wanted to put a
> demo of OptionTypes.jl out in front of other folks.
> 
> 
> * I’m not clear how we could come to know that a datum is not missing
> without a resolution step that’s effectively equivalent to the get()
> function for Option{T}. I agree that the enforced use of get() means
> that you can’t hope to use generic functions like sum on collections
> of Option{T}. But I’m also not sure that’s such a bad thing: I think
> the easiest way to express to the compiler that you know that all of
> the entries of a DataArray are not NA is to convert the DataArray to a
> straight Array. But maybe you have other mechanisms for expressing
> this knowledge. Certainly my proposal to do conversions to Arrays
> isn’t the most elegant strategy. It’s just all that I’ve got so far.
> 
> 
> * I kind of like the idea of Option{T} standing outside of the main
> type system in a kind of mirror type system. I’m less happy about
> Union(NA, T) being a super type of T, even though there are some good
> reasons that you’d like to view T as a specialization of Union(NA, T).
> But I agree that I don’t have a good feel about where missing data
> belongs in the type hierarchy. This is another question for which I’d
> love to get input from others.

Either with Option{T} or with Union(NAtype, T), nullable types create a
mirror type hierarchy. But in the latter case, there are bridges between
the two hierarchies:
- horizontally, T <: Union(NAtype, T)
- vertically (in diagonal), T2 <: Union(NAtype, T1) when T2 <: T1

For example, for Real and Float64:


Real    <- Union(NAtype, Real)
 |        /    |
 |      /      |
 |    /        |
Float64 <- Union(NAtype, Float64)


I see this as a relatively nice definition of the type hierarchy for
missing data.

Thus I agree with Simon that using a type union if possible would be
better than a specific option type which would obscure the type
hierarchy -- which is really essential in Julia. Let us see what people
working on the compiler say.

But thanks for tackling this John, it's a very interesting investigation
and it would be great to be able to handle missing data in a more Julian
fashion!


Regards


> 
> In regard to Simon’s performance points:
> 
> 
> * Yes, memory usage alone argues strongly for working with
> DataArray{T} rather than Array{Option{T}}.
> 
> 
> * Exploting tricks that make operations like anyna() faster is another
> good argument for keeping DataArray{T} around.
> 
> 
> * I’m not sure how to deal with inlining concerns or the undefined
> reference checks. Do you have ideas for improving this within
> DataArrays or do we need supporting changes in the compiler?
> 
> 
>  — John
> 
> 
> On Jul 31, 2014, at 4:49 PM, Simon Kornblith <si...@simonster.com>
> wrote:
> 
> 
> 
> > My previous comments on option types versus union types for
> > DataArray indexing are here. To elaborate on John's summary above,
> > my present views are:
> >       * In many other language, including languages that are
> >         lower-level than Julia such as Rust, option types are union
> >         types and it's the compiler's job to make them fast. Making
> >         union types fast would simply mean that the presence
> >         of Union in the output of code_typed is no longer to be
> >         dreaded.
> >       * The option type approach forces special handling of data
> >         that could potentially be missing but isn't, and this breaks
> >         generic code. I feel it is better if, once we know a datum
> >         is not missing, it has the same concrete type as if it were
> >         known to exist in advance.
> >       * I'm not sure how missing data ought to fit into the type
> >         hierarchy, but neither the Union(NA, T) nor Option{T}
> >         approaches seem completely ideal.
> > Some more specific notes WRT performance:
> >       * Storing option types in an Array{Option{T}} would require
> >         nearly twice as much memory as storing them in a DataArray
> >         in most cases. While a Bool is only 1 byte, there is
> >         additional padding needed for alignment. On x86_64 for types
> >         <= 64 bits the alignment is usually the size of the type.
> >       * As long as option types live in registers and not in memory,
> >         they should consume no additional space as compared with
> >         scalar indexing into both the data and na arrays of a
> >         DataArray, as indexing into the na array would already
> >         require a register. There are, however, some cases where the
> >         BitArray representation of NAs can be exploited for
> >         performance. In John's example code forsum above with dropna
> >         = false above, the Option type approach would read every bit
> >         of the naarray individually. It is faster to first check if
> >         any values are NA, which can be done 64 bits at a time, and
> >         throw an error if there are NAs or sum the values if not.
> >       * There is other overhead related to DataArray indexing that
> >         needs to be addressed, notably that scalar indexing needs to
> >         be inlined (as well as isna and get, if we're using option
> >         types), and that accessing the data and na arrays currently
> >         incurs an undefined reference check.
> > Simon
> > 
> > On Thursday, July 31, 2014 12:51:15 PM UTC-4, John Myles White
> > wrote:
> > 
> >         Harlan, I don't think your assumption about performance is
> >         necessarily correct: it's sometimes the case that you can
> >         get *better* performance by working with bytes rather than
> >         bits, since bytes are the primitive objects of computation.
> >         For that reason, I think the performance implications of
> >         using a full Bool are very sensitive to the exact
> >         computations being performed.
> >         
> >         In general, I don't think the expansion of a BitArray bit
> >         into a Boolean is a big issue for most data analysis tasks.
> >         As evidence, I'd note that expanding a bit to a Bool is
> >         certainly not worse than the cost of translating a single
> >         byte from an ASCIIString or UTF8String object into a
> >         Char object when doing iteration over strings. If you think
> >         that decision for Base Julia is reasonable, I think the
> >         decision to use Option is also defensible for similar
> >         reasons.
> >         
> >         As for integration into the current system, my thinking is
> >         that DataArrays would be changed so that OptionTypes would
> >         be generated whenever you attempt to access a scalar element
> >         of an AbstractDataArray. The following example shows how
> >         that might work:
> >         
> >         function mean(a::AbstractDataArray, dropna::Boolean = false)
> >            sum, n = 0.0, 0
> >            if dropna
> >                for i in 1:length(a)
> >                    o = a[i]
> >                    if !isna(o)
> >                        sum += get(o)
> >                        n += 1
> >                    end
> >                end
> >            else
> >                for i in 1:length(a)
> >                    sum += get(o)
> >                    n += 1
> >                end
> >            end
> >            return sum / n
> >         end
> >         
> >         One could also define this function so that it always
> >         returns an Option type, rather than a direct Float64 value.
> >         
> >         
> >         For special types like Float64, you could even computes
> >         means while returning NaN's using the default values
> >         interface to `get`:
> >         
> >         function mean{T <: FloatingPoint}(a::AbstractDataArray{T})
> >            sum, n = 0.0, 0
> >            for i in 1:length(a)
> >                sum += get(o, nan(T))
> >                n += 1
> >            end
> >            return sum / n
> >         end
> >         
> >         Unlike our current system, the use of OptionTypes would
> >         provide acceptable performance without requiring that users
> >         break abstraction barriers. This is the big gain: Option{T}
> >         exploits Julia's current type system / compiler to express
> >         the same ideas as NAtype does in an efficient way. Options
> >         wouldn't need to be boxed the way that our Union types
> >         currently are being boxed because their type could be easily
> >         inferred by the compiler.
> >         
> >         If you're not familiar with the requirement for breaking our
> >         current abstraction barriers, note that indexing into a
> >         DataArray at the moment poisons the performance of every
> >         program you write, because the result has an uncertain type
> >         that the compiler doesn't know how to optimize. To work
> >         around this, you have to write code that effectively
> >         accesses the raw .na and .data fields of a DataArray.
> >         Simon's done some great work to make this easier to do, but
> >         I'm not sure that's the right direction for us to head in
> >         the long-run.
> >         
> >         In general, I think an approach to missing data built on the
> >         combination of Option{T} and DataArray{T} provides an
> >         interface that's simple and consistent (everything happens
> >         in terms of isna/get) even if it's an interface that's
> >         somewhat unfamilar to R/Python folks. Most important to me
> >         is that using Option types efficiently doesn't require a
> >         deep understanding of Julia's type system, whereas our
> >         current abstractions require you to understand how to work
> >         around problems raised by Union types when programming for
> >         the current compiler. An Option type is basically a forcing
> >         function that says: "Julia is aggressively typed. If you
> >         want to work with a missing value of type T, you need to
> >         explicitly say how you're going to handle any missingness so
> >         that the system only interacts with values of type T."
> >         
> >         I should note that I'm not very sure the use of Options is
> >         the right approach: Simon Kornblith has argued very
> >         persuasively for waiting for the compiler to improve its
> >         ability to handle tagged unions like those generated by
> >         indexing into DataArrays. My personal feeling is that it's
> >         easier to expose a simple abstraction that doesn't assume
> >         the compiler will change dramatically. This is based on my
> >         aesthetic sense that Julia's power comes from making
> >         optimizations transparent and explicit, rather than making
> >         the compiler smarter.
> >         
> >         It's also worth noting that there are problems for which the
> >         use of Option types isn't very helpful: the computation of
> >         medians, for example, isn't defined in terms of scalars, so
> >         having a better abstraction for expressing missing scalars
> >         won't get us anywhere.
> >         
> >         One other caveat is that I'm not providing an `unsafe_get`
> >         method, which means that the `isna` followed by `get` idiom
> >         I showed above does two checks when you could get away with
> >         only one. I still haven't figured out how I'd like to handle
> >         that issue.
> >         
> >         
> >         So there's still a lot of design work needed, but I wanted
> >         to let people see how this interface could work were we to
> >         choose it.
> >         
> >         
> >          -- John
> >         
> >         
> >         On Jul 31, 2014, at 8:15 AM, Harlan Harris
> >         <har...@harris.name> wrote:
> >         
> >         
> >         
> >         > John, how might this interact with DataArrays? This
> >         > design, unlike DataArrays, requires that you use an entire
> >         > byte to store the missingness, so it's not likely to be as
> >         > fast if you're manipulating a lot of them. Is the intended
> >         > use case here something sum(DataArray) yields
> >         > Option{Float64}? 
> >         > 
> >         > 
> >         > 
> >         > On Thu, Jul 31, 2014 at 11:01 AM, Stefan
> >         > Karpinski <ste...@karpinski.org> wrote:
> >         > 
> >         >         This looks quite promising. The `get` interface
> >         >         looks like a very nice way to generically deal
> >         >         with missing values – provide a default or get an
> >         >         error. Automatic conversion of the default to the
> >         >         type of the Option value is particularly nice.
> >         >         This seems like it will be a pleasant and
> >         >         efficient API.
> >         >         
> >         >         
> >         >         
> >         >         Minor question: how come the non-NA constructor
> >         >         for Option takes both the `na` and `value`
> >         >         arguments? Doesn't supplying a value imply that
> >         >         it's not NA while not supplying a value implies
> >         >         that it is NA?
> >         >         
> >         >         
> >         >         
> >         >         On Thursday, July 31, 2014, John Myles White
> >         >         <johnmyl...@gmail.com> wrote:
> >         >         
> >         >                 At JuliaCon, I described the idea of using
> >         >                 Option types instead of NAtype to make it
> >         >                 easier to write type-stable code that
> >         >                 interacts with NA’s. To help facilitate a
> >         >                 debate about the utility of this approach,
> >         >                 I just wrote up a minimal package that
> >         >                 implements Option
> >         >                 types: 
> > https://github.com/johnmyleswhite/OptionTypes.jl
> >         >                 
> >         >                 Essentially, an Option type is a scalar
> >         >                 version of a DataArray: it contains a
> >         >                 value and a Boolean mask indicating
> >         >                 whether that value is NA or not. Because
> >         >                 Option types are parametric, they allow us
> >         >                 to express the variants of NA that R has,
> >         >                 such as a Float64 specific NA and an Int64
> >         >                 specific NA.
> >         >                 
> >         >                  — John
> 
> 
> 

Reply via email to