Re: [Numpy-discussion] Setting custom dtypes and 1.14

Allan Haldane Tue, 30 Jan 2018 09:29:20 -0800

On 01/29/2018 11:50 PM, [email protected] wrote:

On Mon, Jan 29, 2018 at 10:44 PM, Allan Haldane <[email protected]<mailto:[email protected]>> wrote:


    On 01/29/2018 05:59 PM, [email protected]
    <mailto:[email protected]> wrote:



        On Mon, Jan 29, 2018 at 5:50 PM, <[email protected]
        <mailto:[email protected]> <mailto:[email protected]
        <mailto:[email protected]>>> wrote:



             On Mon, Jan 29, 2018 at 4:11 PM, Allan Haldane
             <[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>>
        wrote:

                 On 01/29/2018 04:02 PM, [email protected]
        <mailto:[email protected]>
                 <mailto:[email protected]
        <mailto:[email protected]>> wrote:
                 >
                 >
                 > On Mon, Jan 29, 2018 at 3:44 PM, Benjamin Root
        <[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>
                 > <mailto:[email protected]
        <mailto:[email protected]> <mailto:[email protected]
        <mailto:[email protected]>>>> wrote:
                 >
                 >     I <3 structured arrays. I love the fact that I
        can access data by
                 >     row and then by fieldname, or vice versa. There
        are times when I
                 >     need to pass just a column into a function, and
        there are times when
                 >     I need to process things row by row. Yes, pandas
        is nice if you want
                 >     the specialized indexing features, but it becomes
        a bear to deal
                 >     with if all you want is normal indexing, or even
        the ability to
                 >     easily loop over the dataset.
                 >
                 >
                 > I don't think there is a doubt that structured
        arrays, arrays with
                 > structured dtypes, are a useful container. The
        question is whether they
                 > should be more or the foundation for more.
                 >
                 > For example, computing a mean, or reduce operation,
        over numeric element
                 > ("columns"). Before padded views it was possible to
        index by selecting
                 > the relevant "columns" and view them as standard
        array. With padded
                 > views that breaks and AFAICS, there is no way in
        numpy 1.14.0 to compute
                 > a mean of some "columns". (I don't have numpy 1.14 to
        try or find a
                 > workaround, like maybe looping over all relevant
        columns.)
                 >
                 > Josef

                 Just to clarify, structured types have always had
        padding bytes,
                 that
                 isn't new.

                 What *is* new (which we are pushing to 1.15, I think)
        is that it
                 may be
                 somewhat more common to end up with padding than
        before, and
                 only if you
                 are specifically using multi-field indexing, which is a
        fairly
                 specialized case.

                 I think recfunctions already account properly for
        padding bytes.
                 Except
                 for the bug in #8100, which we will fix, padding-bytes in
                 recarrays are
                 more or less invisible to a non-expert who only cares about
                 dataframe-like behavior.

                 In other words, padding is no obstacle at all to
        computing a
                 mean over a
                 column, and single-field indexes in 1.15 behave
        identically as
                 before.
                 The only thing that will change in 1.15 is multi-field
        indexing,
                 and it
                 has never been possible to compute a mean (or any binary
                 operation) on
                 multiple fields.


             from the example in the other thread
             a[['b', 'c']].view(('f8', 2)).mean(0)


             (from the statsmodels usecase:
             read csv with genfromtext to get recarray or structured array
             select/index the numeric columns
             view them as standard array
             do whatever we can do with standard numpy  arrays
             )


    Oh ok, I misunderstood. I see your point: a mean over fields is more
    difficult than before.

        Or, to phrase it as a question:

        How do we get a standard array with homogeneous dtype from the
        corresponding elements of a structured dtype in numpy 1.14.0?

        Josef


    The answer may be that "numpy has never had a way to that",
    even if in a few special cases you might hack a workaround using views.

    That's what your example seems like to me. It uses an explicit view,
    which is an "expert" feature since views depend on the exact memory
    layout and binary representation of the array. Your example only
    works if the two fields have exactly the same dtype as each other
    and as the final dtype, and evidently breaks if there is byte
    padding for any reason.

    Pandas can do row means without these problems:

         >>> pd.DataFrame(np.ones(10, dtype='i8,f8')).mean(axis=0)

    Numpy is missing this functionality, so you or whoever wrote that
    example figured out a fragile workaround using views.

Once upon a time (*) this wasn't fragile but the only and recommendedway. Because dtypes were low level with clear memory layout and stayedthat way, it was easy to check item size or whatever and get differentviews on it.e.g.https://mail.scipy.org/pipermail/numpy-discussion/2008-December/039340.html

(*) pre-pandas, pre-stackoverflow on the mailing lists which was for meroughly 2008 to 2012but a late threadhttps://mail.scipy.org/pipermail/numpy-discussion/2015-October/074014.html"What is now the recommended way of converting structureddtypes/recarrays to ndarrays?"





    I suggest that if we want to allow either means over fields, or
    conversion of a n-D structured array to an n+1-D regular ndarray, we
    should add a dedicated function to do so in numpy.lib.recfunctions
    which does not depend on the binary representation of the array.


I don't really want to defend an obsolete (?) usecase of structured dtypes.

However, I think there should be a decision about the future plans forwhether dataframe like usages of structure dtypes or through higherlevel classes or functions are still supported, instead of removingslowly and silently (*) the foundation for this use case, either supportthis usage or say you will be dropping it.


(*) I didn't read the details of the release notes


And another footnote about obsolete:

Given that I'm the only one arguing about the dataframe_like usecase ofrecarrays and structured dtypes, I think they are dead for this specificusecase and only my inertia and conservativeness kept them alive instatsmodels.



Josef

It's a bit of a stretch to say that we are "silently" dropping supportfor dataframe-like use of structured arrays.

First, we still allow pretty much all dataframe-like use we havesupported since numpy 1.7, limited as it may be. We are really onlydropping one very specialized, expert use involving an explicit view,which I still have doubts was ever more than a hack. That 2008 mailinglist message didn't involve multi-field indexing, which didn't existthen (only introduced in 2009), and we have wanted to make them views(not copies) since their inception.

Second, I don't think we are doing so silently: We have warned aboutthis in release notes since numpy 1.7 in 2012/2013, and it gets mentionin most releases since then. We have also raised FutureWarnings about itsince 1.7. Unfortunately we missed warning in your specific case for awhile, but we corrected this in 1.12 so you should have seenFutureWarnings since then.

I don't feel the need to officially declare that we are dropping supportfor dataframe-like use of structured arrays. It's unclear where that useends and other uses of structured arrays begin. I think updating thedocs to warn that pandas/dask may be a better choice is enough, as I'vebeen doing, and then users can decide for themselves.

There is still the question about whether we should makenumpy.lib.recfunctions more official. I don't have a strong opinion. Isuppose it would be good to add a section to the structured array docswhich lists those methods and says something like

"the submodule numpy.lib.recfunctions provides minimal functionality tosplit, combine, and manipulate structured datatypes and arrays. In mostcases, we strongly recommend users use a dedicated module such aspandas/xarray/dask instead of these methods, but they are provided foroccasional convenience."


Allan

    Allan


             Josef


                 Allan

                 >
                 >     Cheers!
                 >     Ben Root
                 >
                 >     On Mon, Jan 29, 2018 at 3:24 PM,
        <[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>
                 >     <mailto:[email protected]
        <mailto:[email protected]> <mailto:[email protected]
        <mailto:[email protected]>>>> wrote:
                 >
                 >
                 >
                 >         On Mon, Jan 29, 2018 at 2:55 PM, Stefan van
        der Walt
                 >         <[email protected]
        <mailto:[email protected]> <mailto:[email protected]
        <mailto:[email protected]>>
                 <mailto:[email protected]
        <mailto:[email protected]> <mailto:[email protected]
        <mailto:[email protected]>>>> wrote:
                 >
                 >             On Mon, 29 Jan 2018 14:10:56 -0500,
        [email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>
                  >             <mailto:[email protected]
        <mailto:[email protected]>

                 <mailto:[email protected]
        <mailto:[email protected]>>> wrote:
                  >
                  >                 Given that there is pandas, xarray,
        dask and
                 more, numpy
                  >                 could as well drop
                  >                 any pretense of supporting
        dataframe_likes.
                 Or, adjust
                  >                 the recfunctions so
                  >                 we can still work dataframe_like
        with structured
                  >                 dtypes/recarrays/recfunctions.
                  >
                  >
                  >             I haven't been following the duckarray
        discussion
                 carefully,
                  >             but could
                  >             this be an opportunity for a dataframe
        protocol,
                 so that we
                  >             can have
                  >             libraries ingest structured arrays, record
                 arrays, pandas
                  >             dataframes,
                  >             etc. without too much specialized code?
                  >
                  >
                  >         AFAIU while not being in the data handling area,
                 pandas defines
                  >         the interface and other libraries provide pandas
                 compatible
                  >         interfaces or implementations.
                  >
                  >         statsmodels currently still has recarray
        support and
                 usage. In
                  >         some interfaces we support pandas, recarrays and
                 plain arrays,
                  >         or anything where asarray works correctly.
                  >
                  >         But recarrays became messy to support, one
        rewrite of
                 some
                  >         functions last year converts recarrays to
        pandas,
                 does the
                  >         manipulation and then converts back to
        recarrays.
                  >         Also we need to adjust our recarray usage
        with new numpy
                  >         versions. But there is no real benefit because I
                 doubt that
                  >         statsmodels still has any
        recarray/structured dtype
                 users. So,
                  >         we only have to remove our own uses in the
        datasets
                 and unit tests.
                  >
                  >         Josef
                  >
                  >
                  >
                  >
                  >             Stéfan
                  >

> _______________________________________________

                  >             NumPy-Discussion mailing list
                  > [email protected]
        <mailto:[email protected]>
                 <mailto:[email protected]
        <mailto:[email protected]>>
                 <mailto:[email protected]
        <mailto:[email protected]>
                 <mailto:[email protected]
        <mailto:[email protected]>>>
                  >
        https://mail.python.org/mailman/listinfo/numpy-discussion
        <https://mail.python.org/mailman/listinfo/numpy-discussion>