What's the deal with unsafe_get? Does that segfault if you try to access a
null value reference value and give you random junk if you try to access an
bits value?


On Tue, Aug 12, 2014 at 1:09 AM, John Myles White <johnmyleswh...@gmail.com>
wrote:

> Ok, I’ve cleaned this up and renamed it to NullableTypes.jl:
> https://github.com/johnmyleswhite/NullableTypes.jl
>
>  — John
>
> On Aug 11, 2014, at 9:17 PM, Stefan Karpinski <ste...@karpinski.org>
> wrote:
>
> Sounds good. We could maybe include the Nullable type in Base and thus
> avoid the issue of what to call the module. If we need a module, how about
> Nullables? Although, I have to say that name makes me think of The
> Expendables.
>
> <expendables.png>
>
>
> On Tue, Aug 12, 2014 at 12:10 AM, John Myles White <
> johnmyleswh...@gmail.com> wrote:
>
>> Just wanted to come back to this thread now that I’m back from vacation.
>> It sounds like the consensus of Jeff, Stefan, Keno and Jameson is that
>> we’re better off working with an explicit Option{T} type than trying to get
>> the compiler to handle Union(S, T) more efficiently.
>>
>> If people are willing to accept that idea, I’d like to make the use of
>> Option’s a priority for a release of DataArrays that would accompany Julia
>> 0.4. There’s going to be a lot of changes to Julia’s core with that
>> release, so it seems like the perfect time to make some breaking changes to
>> JuliaData packages.
>>
>> After thinking about the type hierarchy more, I’ve come to really like
>> the interpretion of Option{T} as a 0-or-1 element container type. In that
>> interpretation, Option{T} is to T exactly as Array{T} is to T, which is a
>> relationship that has only horizontal links and no vertical links. I think
>> that perspective simplifies things a lot, whie articulating the core issues
>> involved with working with Option{T}.
>>
>> For me, the main question is whether we want Option types to be a feature
>> that’s shared by people who want to express NULL pointers (i.e. ontological
>> missingness) or whether we want to customize things for statistical
>> missingness (i.e. epistemological missingness). As it stands,
>> OptionTypes.jl implements such a bare-bones version of missingness that it
>> could be safely used for either purpose.
>>
>> As for the naming debate, I think NullableTypes with a new type called
>> Nullable{T} is the way to go.
>>
>>  — John
>>
>> On Aug 1, 2014, at 1:43 PM, Stefan Karpinski <ste...@karpinski.org>
>> wrote:
>>
>> That also strikes me as the best approach for what it's worth – just use
>> option/maybe/nullable for what you return when indexing into a DataArray
>> but keep the DataArray storage as two separate arrays.
>>
>>
>> On Fri, Aug 1, 2014 at 3:29 PM, Jeff Bezanson <jeff.bezan...@gmail.com>
>> wrote:
>>
>>> As usual, I agree with Keno :)
>>>
>>> We could also implement optimizations for Union(Bits,OtherBits). In
>>> theory this can be stack allocated along with a boolean flag that says
>>> which one it is. However to take full advantage of this it seems you
>>> need to generate lots of branches with code for each case. Possible
>>> but tricky.
>>>
>>> There is also a strong connection to the general
>>> array-of-structs-to-struct-of-arrays optimization. It would not be
>>> totally crazy to build this in to our object representation somehow,
>>> or add hooks allowing customization of data representations, like
>>> staged functions but for data instead of code.
>>>
>>>
>>> On Fri, Aug 1, 2014 at 2:54 PM, Jameson Nash <vtjn...@gmail.com> wrote:
>>> > I could (and probably will, someday) revive that commit. At the time,
>>> > though, I seemed to find that it provided little performance benefit
>>> -- the
>>> > gc cost of allocating boxes was far greater (for type uncertainty
>>> involving
>>> > bitstypes) and the type dispatch wasn't as much of a performance
>>> impact as I
>>> > had previously assumed.
>>> >
>>> >
>>> > On Friday, August 1, 2014, Keno Fischer <kfisc...@college.harvard.edu>
>>> > wrote:
>>> >>
>>> >> It is possible to do generic compiler improvements for Union types
>>> >> (Jameson had a branch at some point that did callsite splitting if we
>>> >> inferred a Union type). However, I think the best way to go here is to
>>> >> maintain the current separation of two arrays (one of the values one
>>> >> for the NAs), but give an option type on access. The option type would
>>> >> then most likely be in memory and wouldn't have much overhead. Please
>>> >> let me know if there's anything specific I should explain how the
>>> >> compiler will handle it, I admit I have only skimmed this thread.
>>> >>
>>> >> On Fri, Aug 1, 2014 at 9:18 AM, Simon Kornblith <si...@simonster.com>
>>> >> wrote:
>>> >> > On Friday, August 1, 2014 6:23:59 AM UTC-4, Milan Bouchet-Valat
>>> wrote:
>>> >> >>
>>> >> >> Le jeudi 31 juillet 2014 à 21:19 -0700, John Myles White a écrit :
>>> >> >>
>>> >> >> To address Simon’s general points, which are really good reasons to
>>> >> >> avoid
>>> >> >> jumping on the Option{T} bandwagon too soon:
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> * I agree that most languages use tagged union types for Option{T}
>>> >> >> rather
>>> >> >> than a wrapper type that contains a Boolean value. It’s also
>>> totally
>>> >> >> true
>>> >> >> that many compilers are able to make those constructs more
>>> efficient
>>> >> >> than
>>> >> >> Julia currently does. But what we should expect from Julia in the
>>> >> >> coming
>>> >> >> years isn’t so clear to me. (And I personally think we need to
>>> settle
>>> >> >> on a
>>> >> >> solution for representing missing data that’s viable in a year
>>> rather
>>> >> >> than
>>> >> >> viable in five years.) This is an issue that I’d really like to
>>> have
>>> >> >> input
>>> >> >> on from Jeff, Keno, Jameson or someone else involved with the
>>> internals
>>> >> >> of
>>> >> >> the compiler. Getting input from the broader community is the main
>>> >> >> reason I
>>> >> >> wanted to put a demo of OptionTypes.jl out in front of other folks.
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> * I’m not clear how we could come to know that a datum is not
>>> missing
>>> >> >> without a resolution step that’s effectively equivalent to the
>>> get()
>>> >> >> function for Option{T}. I agree that the enforced use of get()
>>> means
>>> >> >> that
>>> >> >> you can’t hope to use generic functions like sum on collections of
>>> >> >> Option{T}. But I’m also not sure that’s such a bad thing: I think
>>> the
>>> >> >> easiest way to express to the compiler that you know that all of
>>> the
>>> >> >> entries
>>> >> >> of a DataArray are not NA is to convert the DataArray to a straight
>>> >> >> Array.
>>> >> >> But maybe you have other mechanisms for expressing this knowledge.
>>> >> >> Certainly
>>> >> >> my proposal to do conversions to Arrays isn’t the most elegant
>>> >> >> strategy.
>>> >> >> It’s just all that I’ve got so far.
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> * I kind of like the idea of Option{T} standing outside of the main
>>> >> >> type
>>> >> >> system in a kind of mirror type system. I’m less happy about
>>> Union(NA,
>>> >> >> T)
>>> >> >> being a super type of T, even though there are some good reasons
>>> that
>>> >> >> you’d
>>> >> >> like to view T as a specialization of Union(NA, T). But I agree
>>> that I
>>> >> >> don’t
>>> >> >> have a good feel about where missing data belongs in the type
>>> >> >> hierarchy.
>>> >> >> This is another question for which I’d love to get input from
>>> others.
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> In regard to Simon’s performance points:
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> * Yes, memory usage alone argues strongly for working with
>>> DataArray{T}
>>> >> >> rather than Array{Option{T}}.
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> * Exploting tricks that make operations like anyna() faster is
>>> another
>>> >> >> good argument for keeping DataArray{T} around.
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> * I’m not sure how to deal with inlining concerns or the undefined
>>> >> >> reference checks. Do you have ideas for improving this within
>>> >> >> DataArrays or
>>> >> >> do we need supporting changes in the compiler?
>>> >> >>
>>> >> >> Actually it seems it would be possible to make Array{Union(NAtype,
>>> T)}
>>> >> >> more similar to and as efficient as DataArray{T}, by handling a few
>>> >> >> things
>>> >> >> in the compiler. This would create a generalization of DataArray
>>> to any
>>> >> >> kind
>>> >> >> of union type, which could be useful in other contexts. But more
>>> >> >> importantly, it would make missing values integrate seamlessly into
>>> >> >> Julia,
>>> >> >> getting rid of any hacks.
>>> >> >>
>>> >> >> More specifically, the following features would need to be
>>> supported:
>>> >> >> - a way of telling the compiler to store the data as two arrays of
>>> >> >> concrete types (here T and NAtype), instead of as an array of boxed
>>> >> >> values,
>>> >> >> so that:
>>> >> >>     * efficient operations can be performed on the T values (by
>>> >> >> skipping
>>> >> >> the missing ones manually)
>>> >> >>     * T values are stored as a dense array and can be converted to
>>> >> >> Array{T} without any copy or passed to BLAS when no missing values
>>> are
>>> >> >> present
>>> >> >>    * NA values can be packed in a BitArray to save memory and make
>>> NA
>>> >> >> detection faster (see below)
>>> >> >> - a fonction to check whether a given element of the array is of
>>> type T
>>> >> >> rather than of NAtype (generalization of isna())
>>> >> >> - a fonction to check whether all elements of the array are of
>>> type T
>>> >> >> rather than of NAtype (generalization of anyna(), more efficient
>>> than
>>> >> >> calling the previous function on all elements thanks to the
>>> packing of
>>> >> >> NAs
>>> >> >> in a BitArray)
>>> >> >> In this scheme, what is missing is how to allow the compiler to
>>> pack
>>> >> >> NAs
>>> >> >> in a BitArray. Somehow, NAtype would have to be defined as a 1-bit
>>> >> >> object.
>>> >> >> Maybe by making it an enum-like immutable with a 1-bit field
>>> inside it?
>>> >> >>
>>> >> >> How does it sound?
>>> >> >
>>> >> >
>>> >> > I've thought a bit about this, but it seems like it would be too
>>> much
>>> >> > complexity in the compiler. Storing arrays as something besides
>>> >> > contiguous
>>> >> > elements and interaction between the codegen in C/C++ and the
>>> BitArray
>>> >> > code
>>> >> > in Julia both seem likely to be painful, although Jeff, Keno, and
>>> >> > Jameson
>>> >> > would know better than I. Additionally, this optimization (of
>>> storage of
>>> >> > arrays of unions of a singleton type and a bits type) seems pretty
>>> >> > specific
>>> >> > to DataArrays, but the actual advantages in terms of performance and
>>> >> > expressibility would be small or non-existent. (This is in contrast
>>> to
>>> >> > optimizing storage/dispatch with union types, which could benefit a
>>> lot
>>> >> > of
>>> >> > code and is something a lot of languages do.) Finally, there are
>>> cases
>>> >> > where
>>> >> > it is useful to have direct access to the na BitArray chunks beyond
>>> >> > anyna,
>>> >> > e.g. pairwise summation and reductions across the first dimension.
>>> >> >
>>> >> > Simon
>>>
>>
>>
>>
>
>

Reply via email to