On Tue, Jan 30, 2018 at 7:33 PM, Allan Haldane <allanhald...@gmail.com> wrote:
> On 01/30/2018 04:54 PM, josef.p...@gmail.com wrote: > > > > > > On Tue, Jan 30, 2018 at 3:21 PM, Allan Haldane <allanhald...@gmail.com > > <mailto:allanhald...@gmail.com>> wrote: > > > > On 01/30/2018 01:33 PM, josef.p...@gmail.com > > <mailto:josef.p...@gmail.com> wrote: > > > AFAICS, one problem is that the padded view didn't come with the > > > matching down stream usage support, the pack function as > mentioned, an > > > alternative way to convert to a standard ndarray, copy doesn't get > rid > > > of the padding and so on. > > > > > > eg. another mailing list thread I just found with the same problem > > > http://numpy-discussion.10968.n7.nabble.com/view-of- > recarray-issue-td32001.html > > <http://numpy-discussion.10968.n7.nabble.com/view-of- > recarray-issue-td32001.html> > > > > > > quoting Ralf: > > > Question: is that really the recommended way to get an (N, 2) size > float > > > array from two columns of a larger record array? If so, why isn't > there > > > a better way? If you'd want to write to that (N, 2) array you have > to > > > append a copy, making it even uglier. Also, then there really > should be > > > tests for views in test_records.py. > > > > > > > > > This "better way" never showed up, AFAIK. And it looks like we > came back > > > to this problem every few years. > > > > > > Josef > > > > Since we are at least pushing off this change to a later release > > (1.15?), we have some time to prepare/catch up. > > > > What can we add to numpy.lib.recfunctions to make the multi-field > > copy->view change smoother? We have discussed at least two functions: > > > > * repack_fields - rearrange the memory layout of a structured array > to > > add/remove padding between fields > > > > * structured_to_unstructured - turns a n-D structured array into an > > (n+1)-D unstructured ndarray, whose dtype is the highest common type > of > > all the fields. May want the inverse function too. > > > > > > The only sticky point with statsmodels is to have an equivalent of > > a[['b', 'c']].view(('f8', 2)). > > > > Highest common dtype might be object, the main usecase for this is to > > select some elements of a specific dtype and then use them as > > standard,homogeneous ndarray. In our case and other cases that I have > > seen it is mainly to select a subset of the floating point numbers. > > Another case of this might be to combine two strings into one a[['b', > > 'c']].view(('S8')) if b is s5 and c is S3, but I don't think I used > > this in serious code. > > I implemented and put up a draft of these functions in > https://github.com/numpy/numpy/pull/10411 Comments based on reading the last commit > > > I think they satisfy all your cases: code like > > >>> a = np.ones(3, dtype=[('a', 'f8'), ('b', 'f8'), ('c', 'f8')]) > >>> a[['b', 'c']].view(('f8', 2))` > > becomes: > > >>> import numpy.lib.recfunctions as rf > >>> rf.structured_to_unstructured(a[['b', 'c']]) > array([[1., 1.], > [1., 1.], > [1., 1.]]) > > The highest common dtype is usually not "Object", since I use > `np.result_type` to determine the output type. So two fields of 'S5' and > 'S3' result in an 'S5' array. > > structured_to_unstructured looks good to me > > > > > for inverse function: I guess it is still possible to view any standard > > homogenous ndarray with a structured dtype as long as the itemsize > matches. > > The inverse is implemented too. And it even supports varied field > dtypes, nested fields, and subarrays, as you can see in the docstring > examples. > > > > Browsing through old mailing list threads, I saw that adding multiple > > fields or concatenating two arrays with structured dtypes into an array > > with a single combined dtype was missing and I guess still is. (IIRC > > this is the usecase where we go now the pandas detour in statsmodels.) > > > > We might also consider > > > > * apply_along_fields(arr, method) - applies the method along the > > "field" axis, equivalent to something like > > method(struct_to_unstructured(arr), axis=-1) > > > > > > If this works on a padded view of an existing array, then this would be > > an improvement over the current version of having to extract and copy > > the relevant fields of an existing structured dtype or loop over > > different numeric dtypes, ints, floats. > > > > In general there will need to be a way to apply `method` only to > > selected columns, or columns of a matching dtype. (e.g. We don't want > > the sum or mean of a string.) > > (e.g. we use ptp() on numeric fields to check if there is already a > > constant column in the array or dataframe) > > Means over selected columns are accounted for using multi-field > indexing. For example: > > >>> b = np.array([(1, 2, 5), (4, 5, 7), (7, 8 ,11), (10, 11, 12)], > ... dtype=[('x', 'i4'), ('y', 'f4'), ('z', 'f8')]) > > >>> rf.apply_along_fields(np.mean, b) > array([ 2.66666667, 5.33333333, 8.66666667, 11. ]) > > >>> rf.apply_along_fields(np.mean, b[['x', 'z']]) > array([ 3. , 5.5, 9. , 11. ]) > actually, I would have expected apply_along_columns, i.e. reduce over all observations each field. This might need an axis argument. However, in the current form it is less practical than doing it ourselves with structured_to_unstructured because it makes a copy each time of all elements. e.g. rf.apply_along_fields(np.mean, b[['x', 'z']]) rf.apply_along_fields(np.std, b[['x', 'z']]) would do the same structured_to_unstructured copy of all array elements twice. Josef > > > This is unaffected by the 1.14 to 1.15 changes. > > Allan > > > > > > > > > > > > > I think these are pretty minimal and shouldn't be too hard to > implement. > > > > > > AFAICS, it would cover the statsmodels usage. > > > > > > Josef > > > > > > > > > > Allan > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> > > https://mail.python.org/mailman/listinfo/numpy-discussion > > <https://mail.python.org/mailman/listinfo/numpy-discussion> > > > > > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion@python.org > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion