[jira] [Commented] (ARROW-564) [Python] Add methods to return vanilla NumPy arrays (plus boolean mask array if there are nulls)
[ https://issues.apache.org/jira/browse/ARROW-564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447725#comment-16447725 ] ASF GitHub Bot commented on ARROW-564: -- pitrou commented on a change in pull request #1931: ARROW-564 [Python] Add support for return zero copy NumPy arrays URL: https://github.com/apache/arrow/pull/1931#discussion_r183307671 ## File path: python/pyarrow/tests/test_array.py ## @@ -83,6 +83,33 @@ def test_long_array_format(): assert result == expected +def test_to_numpy_zero_copy(): +import gc + +arr = pa.array(range(10)) + +for i in range(10): Review comment: Why do you loop 10 times? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Add methods to return vanilla NumPy arrays (plus boolean mask array > if there are nulls) > > > Key: ARROW-564 > URL: https://issues.apache.org/jira/browse/ARROW-564 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: beginner, pull-request-available > Fix For: 1.0.0 > > > At the moment, for {{pyarrow.Array}} instances, we have a method called > {{to_pandas}}. While this method returns NumPy Arrays, it returns them in the > form that Pandas would use them in its {{Series}}. The difference here is > visible for example in the case of integers with null values. For Pandas, we > convert it into a float array and set all entries to NaN where we have null > entries in the Arrow array. For vanilla NumPy arrays, we would return a tuple > of a valid bytemap (not bitmap!) and a values array. The values array in this > case should simply be a view on the underlying Arrow buffer. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-564) [Python] Add methods to return vanilla NumPy arrays (plus boolean mask array if there are nulls)
[ https://issues.apache.org/jira/browse/ARROW-564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447729#comment-16447729 ] ASF GitHub Bot commented on ARROW-564: -- pitrou commented on a change in pull request #1931: ARROW-564 [Python] Add support for return zero copy NumPy arrays URL: https://github.com/apache/arrow/pull/1931#discussion_r183307323 ## File path: python/pyarrow/tests/test_array.py ## @@ -83,6 +83,33 @@ def test_long_array_format(): assert result == expected +def test_to_numpy_zero_copy(): +import gc + +arr = pa.array(range(10)) + +for i in range(10): +np_arr = arr.to_numpy() +assert sys.getrefcount(np_arr) == 2 Review comment: I'm not sure what this line is meant to test? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Add methods to return vanilla NumPy arrays (plus boolean mask array > if there are nulls) > > > Key: ARROW-564 > URL: https://issues.apache.org/jira/browse/ARROW-564 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: beginner, pull-request-available > Fix For: 1.0.0 > > > At the moment, for {{pyarrow.Array}} instances, we have a method called > {{to_pandas}}. While this method returns NumPy Arrays, it returns them in the > form that Pandas would use them in its {{Series}}. The difference here is > visible for example in the case of integers with null values. For Pandas, we > convert it into a float array and set all entries to NaN where we have null > entries in the Arrow array. For vanilla NumPy arrays, we would return a tuple > of a valid bytemap (not bitmap!) and a values array. The values array in this > case should simply be a view on the underlying Arrow buffer. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-564) [Python] Add methods to return vanilla NumPy arrays (plus boolean mask array if there are nulls)
[ https://issues.apache.org/jira/browse/ARROW-564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447728#comment-16447728 ] ASF GitHub Bot commented on ARROW-564: -- pitrou commented on a change in pull request #1931: ARROW-564 [Python] Add support for return zero copy NumPy arrays URL: https://github.com/apache/arrow/pull/1931#discussion_r183308083 ## File path: python/pyarrow/tests/test_array.py ## @@ -83,6 +83,33 @@ def test_long_array_format(): assert result == expected +def test_to_numpy_zero_copy(): +import gc + +arr = pa.array(range(10)) + +for i in range(10): +np_arr = arr.to_numpy() +assert sys.getrefcount(np_arr) == 2 +np_arr = None # noqa + +assert sys.getrefcount(arr) == 2 + +for i in range(10): +arr = pa.array(range(10)) +np_arr = arr.to_numpy() +arr = None +gc.collect() + +# Ensure base is still valid Review comment: I'm not sure that's the right way of looking at it. Just check that `np_arr.base` is not None... This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Add methods to return vanilla NumPy arrays (plus boolean mask array > if there are nulls) > > > Key: ARROW-564 > URL: https://issues.apache.org/jira/browse/ARROW-564 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: beginner, pull-request-available > Fix For: 1.0.0 > > > At the moment, for {{pyarrow.Array}} instances, we have a method called > {{to_pandas}}. While this method returns NumPy Arrays, it returns them in the > form that Pandas would use them in its {{Series}}. The difference here is > visible for example in the case of integers with null values. For Pandas, we > convert it into a float array and set all entries to NaN where we have null > entries in the Arrow array. For vanilla NumPy arrays, we would return a tuple > of a valid bytemap (not bitmap!) and a values array. The values array in this > case should simply be a view on the underlying Arrow buffer. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-564) [Python] Add methods to return vanilla NumPy arrays (plus boolean mask array if there are nulls)
[ https://issues.apache.org/jira/browse/ARROW-564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447731#comment-16447731 ] ASF GitHub Bot commented on ARROW-564: -- pitrou commented on a change in pull request #1931: ARROW-564 [Python] Add support for return zero copy NumPy arrays URL: https://github.com/apache/arrow/pull/1931#discussion_r183309840 ## File path: python/pyarrow/tests/test_array.py ## @@ -83,6 +83,33 @@ def test_long_array_format(): assert result == expected +def test_to_numpy_zero_copy(): Review comment: This function isn't actually testing the zero-copy part. You should mutate the result Numpy array and check the original Arrow array is mutated (of course, the fact we're able to get a mutable Numpy array from an Arrow array could be seen as a bug). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Add methods to return vanilla NumPy arrays (plus boolean mask array > if there are nulls) > > > Key: ARROW-564 > URL: https://issues.apache.org/jira/browse/ARROW-564 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: beginner, pull-request-available > Fix For: 1.0.0 > > > At the moment, for {{pyarrow.Array}} instances, we have a method called > {{to_pandas}}. While this method returns NumPy Arrays, it returns them in the > form that Pandas would use them in its {{Series}}. The difference here is > visible for example in the case of integers with null values. For Pandas, we > convert it into a float array and set all entries to NaN where we have null > entries in the Arrow array. For vanilla NumPy arrays, we would return a tuple > of a valid bytemap (not bitmap!) and a values array. The values array in this > case should simply be a view on the underlying Arrow buffer. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-564) [Python] Add methods to return vanilla NumPy arrays (plus boolean mask array if there are nulls)
[ https://issues.apache.org/jira/browse/ARROW-564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447727#comment-16447727 ] ASF GitHub Bot commented on ARROW-564: -- pitrou commented on a change in pull request #1931: ARROW-564 [Python] Add support for return zero copy NumPy arrays URL: https://github.com/apache/arrow/pull/1931#discussion_r183308880 ## File path: python/pyarrow/tests/test_array.py ## @@ -577,6 +604,31 @@ def test_simple_type_construction(): str(result) +@pytest.mark.parametrize( +'narr', +[ +np.arange(10, dtype=np.int64), +np.arange(10, dtype=np.int32), +np.arange(10, dtype=np.int16), +np.arange(10, dtype=np.int8), +np.arange(10, dtype=np.uint64), +np.arange(10, dtype=np.uint32), +np.arange(10, dtype=np.uint16), +np.arange(10, dtype=np.uint8), +np.arange(10, dtype=np.float64), +np.arange(10, dtype=np.float32), +np.arange(10, dtype=np.float16), +] +) +def test_to_numpy_roundtrip(narr): +arr = pa.array(narr) +assert narr.dtype == arr.to_numpy().dtype +assert np.array_equal(narr, arr.to_numpy()) Review comment: Use `np.testing.assert_array_equal`. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Add methods to return vanilla NumPy arrays (plus boolean mask array > if there are nulls) > > > Key: ARROW-564 > URL: https://issues.apache.org/jira/browse/ARROW-564 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: beginner, pull-request-available > Fix For: 1.0.0 > > > At the moment, for {{pyarrow.Array}} instances, we have a method called > {{to_pandas}}. While this method returns NumPy Arrays, it returns them in the > form that Pandas would use them in its {{Series}}. The difference here is > visible for example in the case of integers with null values. For Pandas, we > convert it into a float array and set all entries to NaN where we have null > entries in the Arrow array. For vanilla NumPy arrays, we would return a tuple > of a valid bytemap (not bitmap!) and a values array. The values array in this > case should simply be a view on the underlying Arrow buffer. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-564) [Python] Add methods to return vanilla NumPy arrays (plus boolean mask array if there are nulls)
[ https://issues.apache.org/jira/browse/ARROW-564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447730#comment-16447730 ] ASF GitHub Bot commented on ARROW-564: -- pitrou commented on a change in pull request #1931: ARROW-564 [Python] Add support for return zero copy NumPy arrays URL: https://github.com/apache/arrow/pull/1931#discussion_r183308153 ## File path: python/pyarrow/tests/test_array.py ## @@ -83,6 +83,33 @@ def test_long_array_format(): assert result == expected +def test_to_numpy_zero_copy(): +import gc + +arr = pa.array(range(10)) + +for i in range(10): +np_arr = arr.to_numpy() +assert sys.getrefcount(np_arr) == 2 +np_arr = None # noqa + +assert sys.getrefcount(arr) == 2 + +for i in range(10): +arr = pa.array(range(10)) +np_arr = arr.to_numpy() +arr = None +gc.collect() + +# Ensure base is still valid + +# Because of py.test's assert inspection magic, if you put getrefcount +# on the line being examined, it will be 1 higher than you expect +base_refcount = sys.getrefcount(np_arr.base) +assert base_refcount == 2 +np_arr.sum() Review comment: You should check the result value. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Add methods to return vanilla NumPy arrays (plus boolean mask array > if there are nulls) > > > Key: ARROW-564 > URL: https://issues.apache.org/jira/browse/ARROW-564 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: beginner, pull-request-available > Fix For: 1.0.0 > > > At the moment, for {{pyarrow.Array}} instances, we have a method called > {{to_pandas}}. While this method returns NumPy Arrays, it returns them in the > form that Pandas would use them in its {{Series}}. The difference here is > visible for example in the case of integers with null values. For Pandas, we > convert it into a float array and set all entries to NaN where we have null > entries in the Arrow array. For vanilla NumPy arrays, we would return a tuple > of a valid bytemap (not bitmap!) and a values array. The values array in this > case should simply be a view on the underlying Arrow buffer. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-564) [Python] Add methods to return vanilla NumPy arrays (plus boolean mask array if there are nulls)
[ https://issues.apache.org/jira/browse/ARROW-564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447726#comment-16447726 ] ASF GitHub Bot commented on ARROW-564: -- pitrou commented on a change in pull request #1931: ARROW-564 [Python] Add support for return zero copy NumPy arrays URL: https://github.com/apache/arrow/pull/1931#discussion_r183307516 ## File path: python/pyarrow/tests/test_array.py ## @@ -83,6 +83,33 @@ def test_long_array_format(): assert result == expected +def test_to_numpy_zero_copy(): +import gc + +arr = pa.array(range(10)) + +for i in range(10): +np_arr = arr.to_numpy() +assert sys.getrefcount(np_arr) == 2 +np_arr = None # noqa + +assert sys.getrefcount(arr) == 2 Review comment: Instead of harcoding this, you should check the original value hasn't changed: ``` old_refcount = sys.getrefcount(arr) # ... do something assert sys.getrefcount(arr) == old_refcount ``` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Add methods to return vanilla NumPy arrays (plus boolean mask array > if there are nulls) > > > Key: ARROW-564 > URL: https://issues.apache.org/jira/browse/ARROW-564 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: beginner, pull-request-available > Fix For: 1.0.0 > > > At the moment, for {{pyarrow.Array}} instances, we have a method called > {{to_pandas}}. While this method returns NumPy Arrays, it returns them in the > form that Pandas would use them in its {{Series}}. The difference here is > visible for example in the case of integers with null values. For Pandas, we > convert it into a float array and set all entries to NaN where we have null > entries in the Arrow array. For vanilla NumPy arrays, we would return a tuple > of a valid bytemap (not bitmap!) and a values array. The values array in this > case should simply be a view on the underlying Arrow buffer. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-564) [Python] Add methods to return vanilla NumPy arrays (plus boolean mask array if there are nulls)
[ https://issues.apache.org/jira/browse/ARROW-564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447212#comment-16447212 ] ASF GitHub Bot commented on ARROW-564: -- kynan opened a new pull request #1931: ARROW-564 [Python] Add support for return zero copy NumPy arrays URL: https://github.com/apache/arrow/pull/1931 Depends on the in-flight pull request for ARROW-2491 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Add methods to return vanilla NumPy arrays (plus boolean mask array > if there are nulls) > > > Key: ARROW-564 > URL: https://issues.apache.org/jira/browse/ARROW-564 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: beginner, pull-request-available > Fix For: 1.0.0 > > > At the moment, for {{pyarrow.Array}} instances, we have a method called > {{to_pandas}}. While this method returns NumPy Arrays, it returns them in the > form that Pandas would use them in its {{Series}}. The difference here is > visible for example in the case of integers with null values. For Pandas, we > convert it into a float array and set all entries to NaN where we have null > entries in the Arrow array. For vanilla NumPy arrays, we would return a tuple > of a valid bytemap (not bitmap!) and a values array. The values array in this > case should simply be a view on the underlying Arrow buffer. -- This message was sent by Atlassian JIRA (v7.6.3#76005)