[jira] [Created] (ARROW-13474) [C++][Python] PyArrow crash when filter/take empty Extension array
Paul Balanca created ARROW-13474: Summary: [C++][Python] PyArrow crash when filter/take empty Extension array Key: ARROW-13474 URL: https://issues.apache.org/jira/browse/ARROW-13474 Project: Apache Arrow Issue Type: Bug Components: C++, Python Affects Versions: 4.0.1, 4.0.0, 3.0.0 Environment: Python 3.7, Ubuntu 20.04 Reporter: Paul Balanca Assignee: Paul Balanca PyArrow is crashing when apply `filter` or `take` on some already empty extension array. The bug can be reproduced with the documentation example: {code:java} import pyarrow as pa class Point3DArray(pa.ExtensionArray): def to_numpy_array(self): return self.storage.flatten().to_numpy().reshape((-1, 3)) class Point3DType(pa.PyExtensionType): def __init__(self): pa.PyExtensionType.__init__(self, pa.list_(pa.float32(), 3)) def __reduce__(self): return Point3DType, () def __arrow_ext_class__(self): return Point3DArray storage = pa.array([[1, 2, 3], [4, 5, 6]], pa.list_(pa.float32(), 3)) arr = pa.ExtensionArray.from_storage(Point3DType(), storage) arr = arr.filter(pa.array([False, False])) # Crashing here... arr.filter(pa.array([], pa.bool_())) # Crashing as well... arr.take(pa.array([], pa.int32())){code} The underlying issue seems to be that the function `nulls` is not implemented for extension types in the C++ codebase: https://github.com/apache/arrow/blob/6db88a9e946c98c59f179210a70bc05ef6a0a296/cpp/src/arrow/array/util.cc#L472 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11006) [Python] Array to_numpy slow compared to Numpy.view
Paul Balanca created ARROW-11006: Summary: [Python] Array to_numpy slow compared to Numpy.view Key: ARROW-11006 URL: https://issues.apache.org/jira/browse/ARROW-11006 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Paul Balanca Assignee: Paul Balanca The method `to_numpy` is quite slow compare Numpy slice and viewing performance. For instance: {code:java} N = 100 np_arr = np.arange(N) pa_arr = pa.array(np_arr) %timeit l = [np_arr.view() for _ in range(N)] 251 ms ± 27.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit l = [pa_arr.to_numpy(zero_copy_only=True) for _ in range(N)] 1.2 s ± 50.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) {code} The previous benchmark is clearly an extreme case, but the idea is that for any operation not available in PyArrow, failing back on Numpy is a good option and the cost of extracting should be as minimal as possible (there are scenarios where you can't cache easily this view, so you end up calling `to_numpy` a fair amount of times). I would believe that part of this overhead is probably due to PyArrow implementing a very generic Pandas conversion, and using this one even for very simple Numpy-like dense arrays. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10675) [C++][Python] Support AWS S3 Web identity credentials
Paul Balanca created ARROW-10675: Summary: [C++][Python] Support AWS S3 Web identity credentials Key: ARROW-10675 URL: https://issues.apache.org/jira/browse/ARROW-10675 Project: Apache Arrow Issue Type: Improvement Affects Versions: 2.0.0, 1.0.1 Reporter: Paul Balanca It seems to me that Arrow only supports at the moment the "AssumeRole" AWS STS API, but not the other options offered: [https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_request.html#stsapi_comparison] [https://sdk.amazonaws.com/cpp/api/LATEST/class_aws_1_1_auth_1_1_s_t_s_assume_role_web_identity_credentials_provider.html] I am clearly no security/infra expert, but it seems that the configuration "AssumeRoleWithWebIdentity" is used commonly in Kubernetes setups, and I believe it would be beneficial for Arrow C++ & Python library to support. At the moment, a work around is to call directly `aws sts` to generate a temporary session, but it is a fairly paintful as the session expires: all PyArrow objects with an S3 filesystem (datasets, ...) needs to be re-built with new credentials. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10214) [Python] UnicodeDecodeError when printing schema with binary metadata
Paul Balanca created ARROW-10214: Summary: [Python] UnicodeDecodeError when printing schema with binary metadata Key: ARROW-10214 URL: https://issues.apache.org/jira/browse/ARROW-10214 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 1.0.1, 1.0.0, 0.17.1, 0.17.0 Environment: Python 3.6 - 3.8 Reporter: Paul Balanca The following small example raises a `UnicodeDecodeError` error, since PyArrow version 0.17.0: {code:java} import pyarrow as pa bdata = b"\xff\xff\xff\xff8\x02\x00\x00\x10\x00\x00\x00\x00\x00\n\x00\x0c\x00\x06\x00\x05\x00\x08\x00\n\x00\x00\x00\x00\x01\x04\x00\x0c\x00\x00\x00\x08\x00\x08\x00\x00\x00\x04\x00\x08\x00\x00\x00\x04\x00\x00\x00\x02\x00\x00\x00\x00\x01\x00\x00\x04\x00\x00\x00\x1a\xff\xff\xff\x00\x00\x00\x0c\xd0\x00\x00\x00\x9c\x00\x00\x00\x90\x00\x00\x00\x04\x00\x00\x00\x02\x00\x00\x00P\x00\x00\x00\x04\x00\x00\x00\xc0\xfe\xff\xff\x08\x00\x00\x00 \x00\x00\x00\x14\x00\x00\x00ARROW:extension:name\x00\x00\x00\x00\x1b" t = pa.table({"data": pa.array([1, 2])}, metadata={b"k": bdata}) print(t.schema){code} In our case, the binary data is coming from the serialization of another PyArrow schema. But I guess the error can appear with any binary metadata in the schema. The print used to work fine with PyArrow 0.16, getting this output: {code:java} data: int64 metadata OrderedDict([(b'k', b'\xff\xff\xff\xff8\x02\x00\x00\x10\x00\x00\x00\x00\x00\n\x00' b'\x0c\x00\x06\x00\x05\x00\x08\x00\n\x00\x00\x00\x00\x01\x04\x00' b'\x0c\x00\x00\x00\x08\x00\x08\x00\x00\x00\x04\x00' b'\x08\x00\x00\x00\x04\x00\x00\x00\x02\x00\x00\x00' b'\x00\x01\x00\x00\x04\x00\x00\x00\x1a\xff\xff\xff' b'\x00\x00\x00\x0c\xd0\x00\x00\x00\x9c\x00\x00\x00' b'\x90\x00\x00\x00\x04\x00\x00\x00\x02\x00\x00\x00P\x00\x00\x00' b'\x04\x00\x00\x00\xc0\xfe\xff\xff\x08\x00\x00\x00 \x00\x00\x00' b'\x14\x00\x00\x00ARROW:extension:name\x00\x00\x00\x00\x1b')]) {code} I can work on a patch to reverse the behaviour back to PyArrow 0.16? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7605) [C++] Merge jemalloc and other BUNDLED dependencies into libarrow.a
[ https://issues.apache.org/jira/browse/ARROW-7605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094355#comment-17094355 ] Paul Balanca commented on ARROW-7605: - Is there a working way of static compiling with Arrow 0.17.0? I am getting these linking errors due to jemalloc. > [C++] Merge jemalloc and other BUNDLED dependencies into libarrow.a > --- > > Key: ARROW-7605 > URL: https://issues.apache.org/jira/browse/ARROW-7605 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > If ARROW_JEMALLOC=ON, then currently the libarrow.a cannot be used for static > linking without also obtaining libjemalloc_pic.a -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-8131) [Python] Add dynamic attributes to PyArrow ExtensionArray
[ https://issues.apache.org/jira/browse/ARROW-8131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Balanca closed ARROW-8131. --- Resolution: Won't Do > [Python] Add dynamic attributes to PyArrow ExtensionArray > - > > Key: ARROW-8131 > URL: https://issues.apache.org/jira/browse/ARROW-8131 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.16.0 > Environment: Ubuntu 19.10 + Python 3.7 >Reporter: Paul Balanca >Priority: Major > Labels: Python3, pull-request-available > Time Spent: 1h > Remaining Estimate: 0h > > In the present implementation, the interface of the class `ExtensionArray` is > not extendable by user. One can not easily inherit from it, as the > constructor __init__ can not be called directly, or it does not allow adding > dynamically atttributes. > Keeping the current design with build methods `from_*`, I believe it could > then make sense to allow dynamic attributes in `ExtensionArray` (see > [https://cython.readthedocs.io/en/latest/src/userguide/extension_types.html#dynamic-attributes]). > The runtime & size cost of the Python objects would be fairly minimal, > compared to increased flexibility it would allow. > A typical use case where it could be useful would be dynamic mixins (added by > custom Factory), allowing projects based on PyArrow to extend (! :)) the > interface with specific business logic. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8131) [Python] Add dynamic attributes to PyArrow ExtensionArray
[ https://issues.apache.org/jira/browse/ARROW-8131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060833#comment-17060833 ] Paul Balanca commented on ARROW-8131: - I thought at first it was a design choice in Arrow to make the inheritance hard (specialized builder methods {{from_*}}) My first try & intuition was in fact to go for inheritance, and your proposition of adding{{ ExtentionType.__arrow_ext_class__}} sounds like a cleaner solution than trying to dynamically add properties to every instance generated. Let me see if I can make it working the way you suggested. > [Python] Add dynamic attributes to PyArrow ExtensionArray > - > > Key: ARROW-8131 > URL: https://issues.apache.org/jira/browse/ARROW-8131 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.16.0 > Environment: Ubuntu 19.10 + Python 3.7 >Reporter: Paul Balanca >Priority: Major > Labels: Python3, pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > In the present implementation, the interface of the class `ExtensionArray` is > not extendable by user. One can not easily inherit from it, as the > constructor __init__ can not be called directly, or it does not allow adding > dynamically atttributes. > Keeping the current design with build methods `from_*`, I believe it could > then make sense to allow dynamic attributes in `ExtensionArray` (see > [https://cython.readthedocs.io/en/latest/src/userguide/extension_types.html#dynamic-attributes]). > The runtime & size cost of the Python objects would be fairly minimal, > compared to increased flexibility it would allow. > A typical use case where it could be useful would be dynamic mixins (added by > custom Factory), allowing projects based on PyArrow to extend (! :)) the > interface with specific business logic. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8131) [Python] Add dynamic attributes to PyArrow ExtensionArray
[ https://issues.apache.org/jira/browse/ARROW-8131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060794#comment-17060794 ] Paul Balanca commented on ARROW-8131: - Indeed. I am happy to work on ARROW-6176 and re-open an MR if it's the preferred way. > [Python] Add dynamic attributes to PyArrow ExtensionArray > - > > Key: ARROW-8131 > URL: https://issues.apache.org/jira/browse/ARROW-8131 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.16.0 > Environment: Ubuntu 19.10 + Python 3.7 >Reporter: Paul Balanca >Priority: Major > Labels: Python3, pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > In the present implementation, the interface of the class `ExtensionArray` is > not extendable by user. One can not easily inherit from it, as the > constructor __init__ can not be called directly, or it does not allow adding > dynamically atttributes. > Keeping the current design with build methods `from_*`, I believe it could > then make sense to allow dynamic attributes in `ExtensionArray` (see > [https://cython.readthedocs.io/en/latest/src/userguide/extension_types.html#dynamic-attributes]). > The runtime & size cost of the Python objects would be fairly minimal, > compared to increased flexibility it would allow. > A typical use case where it could be useful would be dynamic mixins (added by > custom Factory), allowing projects based on PyArrow to extend (! :)) the > interface with specific business logic. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8131) Add dynamic attributes to PyArrow ExtensionArray
[ https://issues.apache.org/jira/browse/ARROW-8131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Balanca updated ARROW-8131: Labels: Python3 (was: pull-request-available) > Add dynamic attributes to PyArrow ExtensionArray > > > Key: ARROW-8131 > URL: https://issues.apache.org/jira/browse/ARROW-8131 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.16.0 > Environment: Ubuntu 19.10 + Python 3.7 >Reporter: Paul Balanca >Priority: Major > Labels: Python3 > Time Spent: 20m > Remaining Estimate: 0h > > In the present implementation, the interface of the class `ExtensionArray` is > not extendable by user. One can not easily inherit from it, as the > constructor __init__ can not be called directly, or it does not allow adding > dynamically atttributes. > Keeping the current design with build methods `from_*`, I believe it could > then make sense to allow dynamic attributes in `ExtensionArray` (see > [https://cython.readthedocs.io/en/latest/src/userguide/extension_types.html#dynamic-attributes]). > The runtime & size cost of the Python objects would be fairly minimal, > compared to increased flexibility it would allow. > A typical use case where it could be useful would be dynamic mixins (added by > custom Factory), allowing projects based on PyArrow to extend (! :)) the > interface with specific business logic. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8131) Add dynamic attributes to PyArrow ExtensionArray
Paul Balanca created ARROW-8131: --- Summary: Add dynamic attributes to PyArrow ExtensionArray Key: ARROW-8131 URL: https://issues.apache.org/jira/browse/ARROW-8131 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.16.0 Environment: Ubuntu 19.10 + Python 3.7 Reporter: Paul Balanca In the present implementation, the interface of the class `ExtensionArray` is not extendable by user. One can not easily inherit from it, as the constructor __init__ can not be called directly, or it does not allow adding dynamically atttributes. Keeping the current design with build methods `from_*`, I believe it could then make sense to allow dynamic attributes in `ExtensionArray` (see [https://cython.readthedocs.io/en/latest/src/userguide/extension_types.html#dynamic-attributes]). The runtime & size cost of the Python objects would be fairly minimal, compared to increased flexibility it would allow. A typical use case where it could be useful would be dynamic mixins (added by custom Factory), allowing projects based on PyArrow to extend (! :)) the interface with specific business logic. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7365) [Python] Support FixedSizeList type in conversion to numpy/pandas
[ https://issues.apache.org/jira/browse/ARROW-7365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060071#comment-17060071 ] Paul Balanca commented on ARROW-7365: - If I may continue the discussion point raised in ARROW-8010. I believe there is a use case for FixedSizeList arrays to be convertible to two-dimensional Numpy arrays (or even multi-dimensional ones). There exist many applications where ones want to store small vectors/matrices with known static dimensions (i.e. 3d vector, 3d affine transform). The fixed size Arrow column format is ideal for that kind of purpose, and then allow to write high-performance code on this kind of storage. But in order to be possible to write this kind of high perf. pipelines base on PyArrow, one needs to be able to extract the full 2D Numpy array from the PyArrow object. Technically, it is possible as shown by the small example in ARROW-8010, but it would be probably valuable to be part of the official API. Is the `to_numpy` the right place to implement it? I am not sure, I probably don't have the depth of view on this project to have a good opinion. But I believe there are numerous pure Numpy computation pipeline based on PyArrow in-memory storage which would benefit from a "closer to metal" Numpy API, independent of the Pandas-like series representation. > [Python] Support FixedSizeList type in conversion to numpy/pandas > - > > Key: ARROW-7365 > URL: https://issues.apache.org/jira/browse/ARROW-7365 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Priority: Major > Fix For: 0.17.0 > > > Follow-up on ARROW-7261, still need to add support for FixedSizeListType in > the arrow -> python conversion (arrow_to_pandas.cc) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8010) [Python] Fixed size list not convertible to Numpy Array / pandas Series
[ https://issues.apache.org/jira/browse/ARROW-8010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060064#comment-17060064 ] Paul Balanca commented on ARROW-8010: - Thanks for the quick answer. Sorry I did not notice first it was already existing. > [Python] Fixed size list not convertible to Numpy Array / pandas Series > --- > > Key: ARROW-8010 > URL: https://issues.apache.org/jira/browse/ARROW-8010 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.16.0 > Environment: Ubuntu 19.10 + python 3.7 >Reporter: Paul Balanca >Priority: Major > > Fixed size list of base types (i.e. int, float, ...) are not convertible to > Numpy array. > The following code: > {code:java} > import pyarrow as pa > t = pa.list_(pa.float32(), 2) > arr = pa.array([[1, 2], [3, 4], [5, 6]], type=t) > arr.to_numpy(){code} > raises a not implemented Arrow error as there is no Pandas block equivalent. > It sounds reasonable that the conversion to Pandas fails, but I would expect > a natural conversion to Numpy Array, as according to the Fixed Size List > Layout ([https://arrow.apache.org/docs/format/Columnar.html#]), the former > could be mapped to a 2-dimensional row major matrix (e.g. 3x2 in the previous > example). > Note we can get the expected result by working around using flatten: > {code:java} > arr.flatten().to_numpy().reshape((-1, t.list_size)){code} > This form of memory representation is quite natural if ones wants to use > Apache Arrow for in-memory collection of 2D/3D points, where we wish to have > coordinates contiguous in memory. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8010) [Python] Fixed size list not convertible to Numpy Array
[ https://issues.apache.org/jira/browse/ARROW-8010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Balanca updated ARROW-8010: Description: Fixed size list of base types (i.e. int, float, ...) are not convertible to Numpy array. The following code: {code:java} import pyarrow as pa t = pa.list_(pa.float32(), 2) arr = pa.array([[1, 2], [3, 4], [5, 6]], type=t) arr.to_numpy(){code} raises a not implemented Arrow error as there is no Pandas block equivalent. It sounds reasonable that the conversion to Pandas fails, but I would expect a natural conversion to Numpy Array, as according to the Fixed Size List Layout ([https://arrow.apache.org/docs/format/Columnar.html#]), the former could be mapped to a 2-dimensional row major matrix (e.g. 3x2 in the previous example). Note we can get the expected result by working around using flatten: {code:java} arr.flatten().to_numpy().reshape((-1, t.list_size)){code} This form of memory representation is quite natural if ones wants to use Apache Arrow for in-memory collection of 2D/3D points, where we wish to have coordinates contiguous in memory. was: Fixed size list of base types (i.e. int, float, ...) are not convertible to Numpy array. The following code: {code:java} import pyarrow as pa t = pa.list_(pa.float32(), 2) arr = pa.array([[1, 2], [3, 4], [5, 6]], type=t) arr.to_numpy(){code} raises a not implemented Arrow error as there is no Pandas block equivalent. It sounds reasonable that the conversion to Pandas fails, but I would expect a natural conversion to Numpy Array, as according to the Fixed Size List Layout ([https://arrow.apache.org/docs/format/Columnar.html#]), the former could be mapped to a 2-dimensional row major matrix (e.g. 3x2 in the previous example). This form of memory representation is quite natural if ones wants to use Apache Arrow for in-memory collection of 2D/3D points, where we wish to have coordinates contiguous in memory. > [Python] Fixed size list not convertible to Numpy Array > --- > > Key: ARROW-8010 > URL: https://issues.apache.org/jira/browse/ARROW-8010 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.16.0 > Environment: Ubuntu 19.10 + python 3.7 >Reporter: Paul Balanca >Priority: Major > > Fixed size list of base types (i.e. int, float, ...) are not convertible to > Numpy array. > The following code: > {code:java} > import pyarrow as pa > t = pa.list_(pa.float32(), 2) > arr = pa.array([[1, 2], [3, 4], [5, 6]], type=t) > arr.to_numpy(){code} > raises a not implemented Arrow error as there is no Pandas block equivalent. > It sounds reasonable that the conversion to Pandas fails, but I would expect > a natural conversion to Numpy Array, as according to the Fixed Size List > Layout ([https://arrow.apache.org/docs/format/Columnar.html#]), the former > could be mapped to a 2-dimensional row major matrix (e.g. 3x2 in the previous > example). > Note we can get the expected result by working around using flatten: > {code:java} > arr.flatten().to_numpy().reshape((-1, t.list_size)){code} > This form of memory representation is quite natural if ones wants to use > Apache Arrow for in-memory collection of 2D/3D points, where we wish to have > coordinates contiguous in memory. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8010) [Python] Fixed size list not convertible to Numpy Array
[ https://issues.apache.org/jira/browse/ARROW-8010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Balanca updated ARROW-8010: Description: Fixed size list of base types (i.e. int, float, ...) are not convertible to Numpy array. The following code: {code:java} import pyarrow as pa t = pa.list_(pa.float32(), 2) arr = pa.array([[1, 2], [3, 4], [5, 6]], type=t) arr.to_numpy(){code} raises a not implemented Arrow error as there is no Pandas block equivalent. It sounds reasonable that the conversion to Pandas fails, but I would expect a natural conversion to Numpy Array, as according to the Fixed Size List Layout ([https://arrow.apache.org/docs/format/Columnar.html#]), the former could be mapped to a 2-dimensional row major matrix (e.g. 3x2 in the previous example). This form of memory representation is quite natural if ones wants to use Apache Arrow for in-memory collection of 2D/3D points, where we wish to have coordinates contiguous in memory. was: Fixed size list of base types (i.e. int, float, ...) are not convertible to Numpy array. The following code: {code:java} t = pa.list_(pa.float32(), 2) arr = pa.array([[1, 2], [3, 4], [5, 6]], type=t) arr.to_numpy(){code} raises a not implemented Arrow error as there is no Pandas block equivalent. It sounds reasonable that the conversion to Pandas fails, but I would expect a natural conversion to Numpy Array, as according to the Fixed Size List Layout ([https://arrow.apache.org/docs/format/Columnar.html#]), the former could be mapped to a 2-dimensional row major matrix (e.g. 3x2 in the previous example). This form of memory representation is quite natural if ones wants to use Apache Arrow for in-memory collection of 2D/3D points, where we wish to have coordinates contiguous in memory. > [Python] Fixed size list not convertible to Numpy Array > --- > > Key: ARROW-8010 > URL: https://issues.apache.org/jira/browse/ARROW-8010 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.16.0 > Environment: Ubuntu 19.10 + python 3.7 >Reporter: Paul Balanca >Priority: Major > > Fixed size list of base types (i.e. int, float, ...) are not convertible to > Numpy array. > The following code: > {code:java} > import pyarrow as pa > t = pa.list_(pa.float32(), 2) > arr = pa.array([[1, 2], [3, 4], [5, 6]], type=t) > arr.to_numpy(){code} > raises a not implemented Arrow error as there is no Pandas block equivalent. > It sounds reasonable that the conversion to Pandas fails, but I would expect > a natural conversion to Numpy Array, as according to the Fixed Size List > Layout ([https://arrow.apache.org/docs/format/Columnar.html#]), the former > could be mapped to a 2-dimensional row major matrix (e.g. 3x2 in the previous > example). > This form of memory representation is quite natural if ones wants to use > Apache Arrow for in-memory collection of 2D/3D points, where we wish to have > coordinates contiguous in memory. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8010) [Python] Fixed size list not convertible to Numpy Array
Paul Balanca created ARROW-8010: --- Summary: [Python] Fixed size list not convertible to Numpy Array Key: ARROW-8010 URL: https://issues.apache.org/jira/browse/ARROW-8010 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.16.0 Environment: Ubuntu 19.10 + python 3.7 Reporter: Paul Balanca Fixed size list of base types (i.e. int, float, ...) are not convertible to Numpy array. The following code: {code:java} t = pa.list_(pa.float32(), 2) arr = pa.array([[1, 2], [3, 4], [5, 6]], type=t) arr.to_numpy(){code} raises a not implemented Arrow error as there is no Pandas block equivalent. It sounds reasonable that the conversion to Pandas fails, but I would expect a natural conversion to Numpy Array, as according to the Fixed Size List Layout ([https://arrow.apache.org/docs/format/Columnar.html#]), the former could be mapped to a 2-dimensional row major matrix (e.g. 3x2 in the previous example). This form of memory representation is quite natural if ones wants to use Apache Arrow for in-memory collection of 2D/3D points, where we wish to have coordinates contiguous in memory. -- This message was sent by Atlassian Jira (v8.3.4#803005)