[jira] [Created] (ARROW-13474) [C++][Python] PyArrow crash when filter/take empty Extension array

2021-07-28 Thread Paul Balanca (Jira)
Paul Balanca created ARROW-13474:


 Summary: [C++][Python] PyArrow crash when filter/take empty 
Extension array
 Key: ARROW-13474
 URL: https://issues.apache.org/jira/browse/ARROW-13474
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Affects Versions: 4.0.1, 4.0.0, 3.0.0
 Environment: Python 3.7, Ubuntu 20.04
Reporter: Paul Balanca
Assignee: Paul Balanca


PyArrow is crashing when apply `filter` or `take` on some already empty 
extension array.

The bug can be reproduced with the documentation example:
{code:java}
import pyarrow as pa

class Point3DArray(pa.ExtensionArray):
def to_numpy_array(self):
return self.storage.flatten().to_numpy().reshape((-1, 3))


class Point3DType(pa.PyExtensionType):
def __init__(self):
pa.PyExtensionType.__init__(self, pa.list_(pa.float32(), 3))

def __reduce__(self):
return Point3DType, ()

def __arrow_ext_class__(self):
return Point3DArray

storage = pa.array([[1, 2, 3], [4, 5, 6]], pa.list_(pa.float32(), 3))
arr = pa.ExtensionArray.from_storage(Point3DType(), storage)
arr = arr.filter(pa.array([False, False]))
# Crashing here...
arr.filter(pa.array([], pa.bool_()))
# Crashing as well...
arr.take(pa.array([], pa.int32())){code}
The underlying issue seems to be that the function `nulls` is not implemented 
for extension types in the C++ codebase: 
https://github.com/apache/arrow/blob/6db88a9e946c98c59f179210a70bc05ef6a0a296/cpp/src/arrow/array/util.cc#L472



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11006) [Python] Array to_numpy slow compared to Numpy.view

2020-12-22 Thread Paul Balanca (Jira)
Paul Balanca created ARROW-11006:


 Summary: [Python] Array to_numpy slow compared to Numpy.view
 Key: ARROW-11006
 URL: https://issues.apache.org/jira/browse/ARROW-11006
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Paul Balanca
Assignee: Paul Balanca


The method `to_numpy` is quite slow compare Numpy slice and viewing 
performance. For instance:
{code:java}
N = 100
np_arr = np.arange(N)
pa_arr = pa.array(np_arr)

%timeit l = [np_arr.view() for _ in range(N)]
251 ms ± 27.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit l = [pa_arr.to_numpy(zero_copy_only=True) for _ in range(N)]
1.2 s ± 50.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
{code}
The previous benchmark is clearly an extreme case, but the idea is that for any 
operation not available in PyArrow, failing back on Numpy is a good option and 
the cost of extracting should be as minimal as possible (there are scenarios 
where you can't cache easily this view, so you end up calling `to_numpy` a fair 
amount of times).

I would believe that part of this overhead is probably due to PyArrow 
implementing a very generic Pandas conversion, and using this one even for very 
simple Numpy-like dense arrays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10675) [C++][Python] Support AWS S3 Web identity credentials

2020-11-21 Thread Paul Balanca (Jira)
Paul Balanca created ARROW-10675:


 Summary: [C++][Python] Support AWS S3 Web identity credentials
 Key: ARROW-10675
 URL: https://issues.apache.org/jira/browse/ARROW-10675
 Project: Apache Arrow
  Issue Type: Improvement
Affects Versions: 2.0.0, 1.0.1
Reporter: Paul Balanca


It seems to me that Arrow only supports at the moment the "AssumeRole" AWS STS 
API, but not the other options offered: 
[https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_request.html#stsapi_comparison]

[https://sdk.amazonaws.com/cpp/api/LATEST/class_aws_1_1_auth_1_1_s_t_s_assume_role_web_identity_credentials_provider.html]

 

I am clearly no security/infra expert, but it seems that the configuration 
"AssumeRoleWithWebIdentity" is used commonly in Kubernetes setups, and I 
believe it would be beneficial for Arrow C++ & Python library to support.

At the moment, a work around is to call directly `aws sts` to generate a 
temporary session, but it is a fairly paintful as the session expires: all 
PyArrow objects with an S3 filesystem (datasets, ...) needs to be re-built with 
new credentials. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10214) [Python] UnicodeDecodeError when printing schema with binary metadata

2020-10-07 Thread Paul Balanca (Jira)
Paul Balanca created ARROW-10214:


 Summary: [Python] UnicodeDecodeError when printing schema with 
binary metadata
 Key: ARROW-10214
 URL: https://issues.apache.org/jira/browse/ARROW-10214
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 1.0.1, 1.0.0, 0.17.1, 0.17.0
 Environment: Python 3.6 - 3.8
Reporter: Paul Balanca


The following small example raises a `UnicodeDecodeError` error, since PyArrow 
version 0.17.0:
{code:java}
import pyarrow as pa

bdata = 
b"\xff\xff\xff\xff8\x02\x00\x00\x10\x00\x00\x00\x00\x00\n\x00\x0c\x00\x06\x00\x05\x00\x08\x00\n\x00\x00\x00\x00\x01\x04\x00\x0c\x00\x00\x00\x08\x00\x08\x00\x00\x00\x04\x00\x08\x00\x00\x00\x04\x00\x00\x00\x02\x00\x00\x00\x00\x01\x00\x00\x04\x00\x00\x00\x1a\xff\xff\xff\x00\x00\x00\x0c\xd0\x00\x00\x00\x9c\x00\x00\x00\x90\x00\x00\x00\x04\x00\x00\x00\x02\x00\x00\x00P\x00\x00\x00\x04\x00\x00\x00\xc0\xfe\xff\xff\x08\x00\x00\x00
 \x00\x00\x00\x14\x00\x00\x00ARROW:extension:name\x00\x00\x00\x00\x1b"

t = pa.table({"data": pa.array([1, 2])}, metadata={b"k": bdata})
print(t.schema){code}
In our case, the binary data is coming from the serialization of another 
PyArrow schema. But I guess the error can appear with any binary metadata in 
the schema.

The print used to work fine with PyArrow 0.16, getting this output:
{code:java}
data: int64
metadata

OrderedDict([(b'k',
  b'\xff\xff\xff\xff8\x02\x00\x00\x10\x00\x00\x00\x00\x00\n\x00'
  b'\x0c\x00\x06\x00\x05\x00\x08\x00\n\x00\x00\x00\x00\x01\x04\x00'
  b'\x0c\x00\x00\x00\x08\x00\x08\x00\x00\x00\x04\x00'
  b'\x08\x00\x00\x00\x04\x00\x00\x00\x02\x00\x00\x00'
  b'\x00\x01\x00\x00\x04\x00\x00\x00\x1a\xff\xff\xff'
  b'\x00\x00\x00\x0c\xd0\x00\x00\x00\x9c\x00\x00\x00'
  b'\x90\x00\x00\x00\x04\x00\x00\x00\x02\x00\x00\x00P\x00\x00\x00'
  b'\x04\x00\x00\x00\xc0\xfe\xff\xff\x08\x00\x00\x00 \x00\x00\x00'
  b'\x14\x00\x00\x00ARROW:extension:name\x00\x00\x00\x00\x1b')])
{code}
I can work on a patch to reverse the behaviour back to PyArrow 0.16?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7605) [C++] Merge jemalloc and other BUNDLED dependencies into libarrow.a

2020-04-28 Thread Paul Balanca (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094355#comment-17094355
 ] 

Paul Balanca commented on ARROW-7605:
-

Is there a working way of static compiling with Arrow 0.17.0? I am getting 
these linking errors due to jemalloc.

> [C++] Merge jemalloc and other BUNDLED dependencies into libarrow.a
> ---
>
> Key: ARROW-7605
> URL: https://issues.apache.org/jira/browse/ARROW-7605
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> If ARROW_JEMALLOC=ON, then currently the libarrow.a cannot be used for static 
> linking without also obtaining libjemalloc_pic.a



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-8131) [Python] Add dynamic attributes to PyArrow ExtensionArray

2020-03-24 Thread Paul Balanca (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Balanca closed ARROW-8131.
---
Resolution: Won't Do

> [Python] Add dynamic attributes to PyArrow ExtensionArray
> -
>
> Key: ARROW-8131
> URL: https://issues.apache.org/jira/browse/ARROW-8131
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.16.0
> Environment: Ubuntu 19.10 + Python 3.7
>Reporter: Paul Balanca
>Priority: Major
>  Labels: Python3, pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> In the present implementation, the interface of the class `ExtensionArray` is 
> not extendable by user. One can not easily inherit from it, as the 
> constructor __init__ can not be called directly, or it does not allow adding 
> dynamically atttributes.
> Keeping the current design with build methods `from_*`, I believe it could 
> then make sense to allow dynamic attributes in `ExtensionArray` (see 
> [https://cython.readthedocs.io/en/latest/src/userguide/extension_types.html#dynamic-attributes]).
>  The runtime & size cost of the Python objects would be fairly minimal, 
> compared to increased flexibility it would allow.
> A typical use case where it could be useful would be dynamic mixins (added by 
> custom Factory), allowing projects based on PyArrow to extend (! :)) the 
> interface with specific business logic. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8131) [Python] Add dynamic attributes to PyArrow ExtensionArray

2020-03-17 Thread Paul Balanca (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060833#comment-17060833
 ] 

Paul Balanca commented on ARROW-8131:
-

I thought at first it was a design choice in Arrow to make the inheritance hard 
(specialized builder methods {{from_*}})

My first try & intuition was in fact to go for inheritance, and your 
proposition of adding{{ ExtentionType.__arrow_ext_class__}} sounds like a 
cleaner solution than trying to dynamically add properties to every instance 
generated. Let me see if I can make it working the way you suggested.

> [Python] Add dynamic attributes to PyArrow ExtensionArray
> -
>
> Key: ARROW-8131
> URL: https://issues.apache.org/jira/browse/ARROW-8131
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.16.0
> Environment: Ubuntu 19.10 + Python 3.7
>Reporter: Paul Balanca
>Priority: Major
>  Labels: Python3, pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> In the present implementation, the interface of the class `ExtensionArray` is 
> not extendable by user. One can not easily inherit from it, as the 
> constructor __init__ can not be called directly, or it does not allow adding 
> dynamically atttributes.
> Keeping the current design with build methods `from_*`, I believe it could 
> then make sense to allow dynamic attributes in `ExtensionArray` (see 
> [https://cython.readthedocs.io/en/latest/src/userguide/extension_types.html#dynamic-attributes]).
>  The runtime & size cost of the Python objects would be fairly minimal, 
> compared to increased flexibility it would allow.
> A typical use case where it could be useful would be dynamic mixins (added by 
> custom Factory), allowing projects based on PyArrow to extend (! :)) the 
> interface with specific business logic. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8131) [Python] Add dynamic attributes to PyArrow ExtensionArray

2020-03-17 Thread Paul Balanca (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060794#comment-17060794
 ] 

Paul Balanca commented on ARROW-8131:
-

Indeed. I am happy to work on ARROW-6176 and re-open an MR if it's the 
preferred way.

> [Python] Add dynamic attributes to PyArrow ExtensionArray
> -
>
> Key: ARROW-8131
> URL: https://issues.apache.org/jira/browse/ARROW-8131
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.16.0
> Environment: Ubuntu 19.10 + Python 3.7
>Reporter: Paul Balanca
>Priority: Major
>  Labels: Python3, pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> In the present implementation, the interface of the class `ExtensionArray` is 
> not extendable by user. One can not easily inherit from it, as the 
> constructor __init__ can not be called directly, or it does not allow adding 
> dynamically atttributes.
> Keeping the current design with build methods `from_*`, I believe it could 
> then make sense to allow dynamic attributes in `ExtensionArray` (see 
> [https://cython.readthedocs.io/en/latest/src/userguide/extension_types.html#dynamic-attributes]).
>  The runtime & size cost of the Python objects would be fairly minimal, 
> compared to increased flexibility it would allow.
> A typical use case where it could be useful would be dynamic mixins (added by 
> custom Factory), allowing projects based on PyArrow to extend (! :)) the 
> interface with specific business logic. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8131) Add dynamic attributes to PyArrow ExtensionArray

2020-03-16 Thread Paul Balanca (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Balanca updated ARROW-8131:

Labels: Python3  (was: pull-request-available)

> Add dynamic attributes to PyArrow ExtensionArray
> 
>
> Key: ARROW-8131
> URL: https://issues.apache.org/jira/browse/ARROW-8131
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.16.0
> Environment: Ubuntu 19.10 + Python 3.7
>Reporter: Paul Balanca
>Priority: Major
>  Labels: Python3
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In the present implementation, the interface of the class `ExtensionArray` is 
> not extendable by user. One can not easily inherit from it, as the 
> constructor __init__ can not be called directly, or it does not allow adding 
> dynamically atttributes.
> Keeping the current design with build methods `from_*`, I believe it could 
> then make sense to allow dynamic attributes in `ExtensionArray` (see 
> [https://cython.readthedocs.io/en/latest/src/userguide/extension_types.html#dynamic-attributes]).
>  The runtime & size cost of the Python objects would be fairly minimal, 
> compared to increased flexibility it would allow.
> A typical use case where it could be useful would be dynamic mixins (added by 
> custom Factory), allowing projects based on PyArrow to extend (! :)) the 
> interface with specific business logic. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8131) Add dynamic attributes to PyArrow ExtensionArray

2020-03-16 Thread Paul Balanca (Jira)
Paul Balanca created ARROW-8131:
---

 Summary: Add dynamic attributes to PyArrow ExtensionArray
 Key: ARROW-8131
 URL: https://issues.apache.org/jira/browse/ARROW-8131
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 0.16.0
 Environment: Ubuntu 19.10 + Python 3.7
Reporter: Paul Balanca


In the present implementation, the interface of the class `ExtensionArray` is 
not extendable by user. One can not easily inherit from it, as the constructor 
__init__ can not be called directly, or it does not allow adding dynamically 
atttributes.

Keeping the current design with build methods `from_*`, I believe it could then 
make sense to allow dynamic attributes in `ExtensionArray` (see 
[https://cython.readthedocs.io/en/latest/src/userguide/extension_types.html#dynamic-attributes]).
 The runtime & size cost of the Python objects would be fairly minimal, 
compared to increased flexibility it would allow.

A typical use case where it could be useful would be dynamic mixins (added by 
custom Factory), allowing projects based on PyArrow to extend (! :)) the 
interface with specific business logic. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7365) [Python] Support FixedSizeList type in conversion to numpy/pandas

2020-03-16 Thread Paul Balanca (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060071#comment-17060071
 ] 

Paul Balanca commented on ARROW-7365:
-

If I may continue the discussion point raised in ARROW-8010.

I believe there is a use case for FixedSizeList arrays to be convertible to 
two-dimensional Numpy arrays (or even multi-dimensional ones). There exist many 
applications where ones want to store small vectors/matrices with known static 
dimensions (i.e. 3d vector, 3d affine transform). The fixed size Arrow column 
format is ideal for that kind of purpose, and then allow to write 
high-performance code on this kind of storage.

But in order to be possible to write this kind of high perf. pipelines base on 
PyArrow, one needs to be able to extract the full 2D Numpy array from the 
PyArrow object. Technically, it is possible as shown by the small example in 
ARROW-8010, but it would be probably valuable to be part of the official API.

Is the `to_numpy` the right place to implement it? I am not sure, I probably 
don't have the depth of view on this project to have a good opinion. But I 
believe there are numerous pure Numpy computation pipeline based on PyArrow 
in-memory storage which would benefit from a "closer to metal" Numpy API, 
independent of the Pandas-like series representation.

> [Python] Support FixedSizeList type in conversion to numpy/pandas
> -
>
> Key: ARROW-7365
> URL: https://issues.apache.org/jira/browse/ARROW-7365
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
> Fix For: 0.17.0
>
>
> Follow-up on ARROW-7261, still need to add support for FixedSizeListType in 
> the arrow -> python conversion (arrow_to_pandas.cc)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8010) [Python] Fixed size list not convertible to Numpy Array / pandas Series

2020-03-16 Thread Paul Balanca (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060064#comment-17060064
 ] 

Paul Balanca commented on ARROW-8010:
-

Thanks for the quick answer. Sorry I did not notice first it was already 
existing.

> [Python] Fixed size list not convertible to Numpy Array / pandas Series
> ---
>
> Key: ARROW-8010
> URL: https://issues.apache.org/jira/browse/ARROW-8010
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.16.0
> Environment: Ubuntu 19.10 + python 3.7
>Reporter: Paul Balanca
>Priority: Major
>
> Fixed size list of base types (i.e. int, float, ...) are not convertible to 
> Numpy array.
> The following code:
> {code:java}
> import pyarrow as pa
> t = pa.list_(pa.float32(), 2)
> arr = pa.array([[1, 2], [3, 4], [5, 6]], type=t)
> arr.to_numpy(){code}
> raises a not implemented Arrow error as there is no Pandas block equivalent.
> It sounds reasonable that the conversion to Pandas fails, but I would expect 
> a natural conversion to Numpy Array, as according to the Fixed Size List 
> Layout ([https://arrow.apache.org/docs/format/Columnar.html#]), the former 
> could be mapped to a 2-dimensional row major matrix (e.g. 3x2 in the previous 
> example).
> Note we can get the expected result by working around using flatten:
> {code:java}
> arr.flatten().to_numpy().reshape((-1, t.list_size)){code}
> This form of memory representation is quite natural if ones wants to use 
> Apache Arrow for in-memory collection of 2D/3D points, where we wish to have 
> coordinates contiguous in memory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8010) [Python] Fixed size list not convertible to Numpy Array

2020-03-05 Thread Paul Balanca (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Balanca updated ARROW-8010:

Description: 
Fixed size list of base types (i.e. int, float, ...) are not convertible to 
Numpy array.

The following code:
{code:java}
import pyarrow as pa

t = pa.list_(pa.float32(), 2)
arr = pa.array([[1, 2], [3, 4], [5, 6]], type=t)
arr.to_numpy(){code}
raises a not implemented Arrow error as there is no Pandas block equivalent.

It sounds reasonable that the conversion to Pandas fails, but I would expect a 
natural conversion to Numpy Array, as according to the Fixed Size List Layout 
([https://arrow.apache.org/docs/format/Columnar.html#]), the former could be 
mapped to a 2-dimensional row major matrix (e.g. 3x2 in the previous example).

Note we can get the expected result by working around using flatten:
{code:java}
arr.flatten().to_numpy().reshape((-1, t.list_size)){code}
This form of memory representation is quite natural if ones wants to use Apache 
Arrow for in-memory collection of 2D/3D points, where we wish to have 
coordinates contiguous in memory.

  was:
Fixed size list of base types (i.e. int, float, ...) are not convertible to 
Numpy array.

The following code:
{code:java}
import pyarrow as pa

t = pa.list_(pa.float32(), 2)
arr = pa.array([[1, 2], [3, 4], [5, 6]], type=t)
arr.to_numpy(){code}
raises a not implemented Arrow error as there is no Pandas block equivalent.

It sounds reasonable that the conversion to Pandas fails, but I would expect a 
natural conversion to Numpy Array, as according to the Fixed Size List Layout 
([https://arrow.apache.org/docs/format/Columnar.html#]), the former could be 
mapped to a 2-dimensional row major matrix (e.g. 3x2 in the previous example).

This form of memory representation is quite natural if ones wants to use Apache 
Arrow for in-memory collection of 2D/3D points, where we wish to have 
coordinates contiguous in memory.


> [Python] Fixed size list not convertible to Numpy Array
> ---
>
> Key: ARROW-8010
> URL: https://issues.apache.org/jira/browse/ARROW-8010
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.16.0
> Environment: Ubuntu 19.10 + python 3.7
>Reporter: Paul Balanca
>Priority: Major
>
> Fixed size list of base types (i.e. int, float, ...) are not convertible to 
> Numpy array.
> The following code:
> {code:java}
> import pyarrow as pa
> t = pa.list_(pa.float32(), 2)
> arr = pa.array([[1, 2], [3, 4], [5, 6]], type=t)
> arr.to_numpy(){code}
> raises a not implemented Arrow error as there is no Pandas block equivalent.
> It sounds reasonable that the conversion to Pandas fails, but I would expect 
> a natural conversion to Numpy Array, as according to the Fixed Size List 
> Layout ([https://arrow.apache.org/docs/format/Columnar.html#]), the former 
> could be mapped to a 2-dimensional row major matrix (e.g. 3x2 in the previous 
> example).
> Note we can get the expected result by working around using flatten:
> {code:java}
> arr.flatten().to_numpy().reshape((-1, t.list_size)){code}
> This form of memory representation is quite natural if ones wants to use 
> Apache Arrow for in-memory collection of 2D/3D points, where we wish to have 
> coordinates contiguous in memory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8010) [Python] Fixed size list not convertible to Numpy Array

2020-03-05 Thread Paul Balanca (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Balanca updated ARROW-8010:

Description: 
Fixed size list of base types (i.e. int, float, ...) are not convertible to 
Numpy array.

The following code:
{code:java}
import pyarrow as pa

t = pa.list_(pa.float32(), 2)
arr = pa.array([[1, 2], [3, 4], [5, 6]], type=t)
arr.to_numpy(){code}
raises a not implemented Arrow error as there is no Pandas block equivalent.

It sounds reasonable that the conversion to Pandas fails, but I would expect a 
natural conversion to Numpy Array, as according to the Fixed Size List Layout 
([https://arrow.apache.org/docs/format/Columnar.html#]), the former could be 
mapped to a 2-dimensional row major matrix (e.g. 3x2 in the previous example).

This form of memory representation is quite natural if ones wants to use Apache 
Arrow for in-memory collection of 2D/3D points, where we wish to have 
coordinates contiguous in memory.

  was:
Fixed size list of base types (i.e. int, float, ...) are not convertible to 
Numpy array.

The following code:
{code:java}
t = pa.list_(pa.float32(), 2)
arr = pa.array([[1, 2], [3, 4], [5, 6]], type=t)
arr.to_numpy(){code}
raises a not implemented Arrow error as there is no Pandas block equivalent.

It sounds reasonable that the conversion to Pandas fails, but I would expect a 
natural conversion to Numpy Array, as according to the Fixed Size List Layout 
([https://arrow.apache.org/docs/format/Columnar.html#]), the former could be 
mapped to a 2-dimensional row major matrix (e.g. 3x2 in the previous example).

This form of memory representation is quite natural if ones wants to use Apache 
Arrow for in-memory collection of 2D/3D points, where we wish to have 
coordinates contiguous in memory.


> [Python] Fixed size list not convertible to Numpy Array
> ---
>
> Key: ARROW-8010
> URL: https://issues.apache.org/jira/browse/ARROW-8010
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.16.0
> Environment: Ubuntu 19.10 + python 3.7
>Reporter: Paul Balanca
>Priority: Major
>
> Fixed size list of base types (i.e. int, float, ...) are not convertible to 
> Numpy array.
> The following code:
> {code:java}
> import pyarrow as pa
> t = pa.list_(pa.float32(), 2)
> arr = pa.array([[1, 2], [3, 4], [5, 6]], type=t)
> arr.to_numpy(){code}
> raises a not implemented Arrow error as there is no Pandas block equivalent.
> It sounds reasonable that the conversion to Pandas fails, but I would expect 
> a natural conversion to Numpy Array, as according to the Fixed Size List 
> Layout ([https://arrow.apache.org/docs/format/Columnar.html#]), the former 
> could be mapped to a 2-dimensional row major matrix (e.g. 3x2 in the previous 
> example).
> This form of memory representation is quite natural if ones wants to use 
> Apache Arrow for in-memory collection of 2D/3D points, where we wish to have 
> coordinates contiguous in memory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8010) [Python] Fixed size list not convertible to Numpy Array

2020-03-05 Thread Paul Balanca (Jira)
Paul Balanca created ARROW-8010:
---

 Summary: [Python] Fixed size list not convertible to Numpy Array
 Key: ARROW-8010
 URL: https://issues.apache.org/jira/browse/ARROW-8010
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 0.16.0
 Environment: Ubuntu 19.10 + python 3.7
Reporter: Paul Balanca


Fixed size list of base types (i.e. int, float, ...) are not convertible to 
Numpy array.

The following code:
{code:java}
t = pa.list_(pa.float32(), 2)
arr = pa.array([[1, 2], [3, 4], [5, 6]], type=t)
arr.to_numpy(){code}
raises a not implemented Arrow error as there is no Pandas block equivalent.

It sounds reasonable that the conversion to Pandas fails, but I would expect a 
natural conversion to Numpy Array, as according to the Fixed Size List Layout 
([https://arrow.apache.org/docs/format/Columnar.html#]), the former could be 
mapped to a 2-dimensional row major matrix (e.g. 3x2 in the previous example).

This form of memory representation is quite natural if ones wants to use Apache 
Arrow for in-memory collection of 2D/3D points, where we wish to have 
coordinates contiguous in memory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)