[jira] [Commented] (ARROW-2040) [Python] pyarrow.read_serialized returns bogus data

2018-02-28 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16380331#comment-16380331
 ] 

Antoine Pitrou commented on ARROW-2040:
---

The existence of the "base" parameter in various public serialization APIs is a 
bit weird. Normally pyarrow would infer the base object (i.e. the object owning 
the memory buffer) for deserialized Numpy arrays by itself... Is there a use 
case for it? [~xhochy]

> [Python] pyarrow.read_serialized returns bogus data
> ---
>
> Key: ARROW-2040
> URL: https://issues.apache.org/jira/browse/ARROW-2040
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Richard Shin
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 0.9.0
>
>
> pyarrow.deserialize works fine, however.
> {code:python}
> Python 2.7.12 (default, Nov 20 2017, 18:23:56)
> [GCC 5.4.0 20160609] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pyarrow as pa, numpy as np
> >>> with open('test.pyarrow', 'w') as f:
> ... f.write(pa.serialize(np.arange(10, 
> dtype=np.int32)).to_buffer().to_pybytes())
> ...
> >>> pa.read_serialized(pa.OSFile('test.pyarrow')).deserialize()
> array([54846320, 0, 45484448, 0, 4, 5, 6, 7, 8, 9], dtype=int32)
> >>> pa.deserialize(pa.frombuffer(open('test.pyarrow').read()))
> array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2040) [Python] pyarrow.read_serialized returns bogus data

2018-02-28 Thread Uwe L. Korn (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16380371#comment-16380371
 ] 

Uwe L. Korn commented on ARROW-2040:


Sorry I have no idea about that interface, probably [~robertnishihara] or 
[~pcmoritz] may know better.

> [Python] pyarrow.read_serialized returns bogus data
> ---
>
> Key: ARROW-2040
> URL: https://issues.apache.org/jira/browse/ARROW-2040
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Richard Shin
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 0.9.0
>
>
> pyarrow.deserialize works fine, however.
> {code:python}
> Python 2.7.12 (default, Nov 20 2017, 18:23:56)
> [GCC 5.4.0 20160609] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pyarrow as pa, numpy as np
> >>> with open('test.pyarrow', 'w') as f:
> ... f.write(pa.serialize(np.arange(10, 
> dtype=np.int32)).to_buffer().to_pybytes())
> ...
> >>> pa.read_serialized(pa.OSFile('test.pyarrow')).deserialize()
> array([54846320, 0, 45484448, 0, 4, 5, 6, 7, 8, 9], dtype=int32)
> >>> pa.deserialize(pa.frombuffer(open('test.pyarrow').read()))
> array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2040) [Python] pyarrow.read_serialized returns bogus data

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16380377#comment-16380377
 ] 

ASF GitHub Bot commented on ARROW-2040:
---

pitrou opened a new pull request #1680: ARROW-2040: [Python] Deserialized Numpy 
array must keep ref to underlying tensor
URL: https://github.com/apache/arrow/pull/1680
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] pyarrow.read_serialized returns bogus data
> ---
>
> Key: ARROW-2040
> URL: https://issues.apache.org/jira/browse/ARROW-2040
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Richard Shin
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> pyarrow.deserialize works fine, however.
> {code:python}
> Python 2.7.12 (default, Nov 20 2017, 18:23:56)
> [GCC 5.4.0 20160609] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pyarrow as pa, numpy as np
> >>> with open('test.pyarrow', 'w') as f:
> ... f.write(pa.serialize(np.arange(10, 
> dtype=np.int32)).to_buffer().to_pybytes())
> ...
> >>> pa.read_serialized(pa.OSFile('test.pyarrow')).deserialize()
> array([54846320, 0, 45484448, 0, 4, 5, 6, 7, 8, 9], dtype=int32)
> >>> pa.deserialize(pa.frombuffer(open('test.pyarrow').read()))
> array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2040) [Python] pyarrow.read_serialized returns bogus data

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16380459#comment-16380459
 ] 

ASF GitHub Bot commented on ARROW-2040:
---

wesm commented on issue #1680: ARROW-2040: [Python] Deserialized Numpy array 
must keep ref to underlying tensor
URL: https://github.com/apache/arrow/pull/1680#issuecomment-369272242
 
 
   Appveyor build here: 
https://ci.appveyor.com/project/pitrou/arrow/build/1.0.146. will merge once 
that moves along a little more


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] pyarrow.read_serialized returns bogus data
> ---
>
> Key: ARROW-2040
> URL: https://issues.apache.org/jira/browse/ARROW-2040
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Richard Shin
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> pyarrow.deserialize works fine, however.
> {code:python}
> Python 2.7.12 (default, Nov 20 2017, 18:23:56)
> [GCC 5.4.0 20160609] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pyarrow as pa, numpy as np
> >>> with open('test.pyarrow', 'w') as f:
> ... f.write(pa.serialize(np.arange(10, 
> dtype=np.int32)).to_buffer().to_pybytes())
> ...
> >>> pa.read_serialized(pa.OSFile('test.pyarrow')).deserialize()
> array([54846320, 0, 45484448, 0, 4, 5, 6, 7, 8, 9], dtype=int32)
> >>> pa.deserialize(pa.frombuffer(open('test.pyarrow').read()))
> array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2040) [Python] pyarrow.read_serialized returns bogus data

2018-02-28 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16380465#comment-16380465
 ] 

Wes McKinney commented on ARROW-2040:
-

[~pitrou] which APIs are you referring to in particular? The base parameter may 
not be needed, I could take a closer look

> [Python] pyarrow.read_serialized returns bogus data
> ---
>
> Key: ARROW-2040
> URL: https://issues.apache.org/jira/browse/ARROW-2040
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Richard Shin
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> pyarrow.deserialize works fine, however.
> {code:python}
> Python 2.7.12 (default, Nov 20 2017, 18:23:56)
> [GCC 5.4.0 20160609] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pyarrow as pa, numpy as np
> >>> with open('test.pyarrow', 'w') as f:
> ... f.write(pa.serialize(np.arange(10, 
> dtype=np.int32)).to_buffer().to_pybytes())
> ...
> >>> pa.read_serialized(pa.OSFile('test.pyarrow')).deserialize()
> array([54846320, 0, 45484448, 0, 4, 5, 6, 7, 8, 9], dtype=int32)
> >>> pa.deserialize(pa.frombuffer(open('test.pyarrow').read()))
> array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2040) [Python] pyarrow.read_serialized returns bogus data

2018-02-28 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16380539#comment-16380539
 ] 

Antoine Pitrou commented on ARROW-2040:
---

I'm thinking about {{read_serialized()}} and {{deserialize_from()}}. 

> [Python] pyarrow.read_serialized returns bogus data
> ---
>
> Key: ARROW-2040
> URL: https://issues.apache.org/jira/browse/ARROW-2040
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Richard Shin
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> pyarrow.deserialize works fine, however.
> {code:python}
> Python 2.7.12 (default, Nov 20 2017, 18:23:56)
> [GCC 5.4.0 20160609] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pyarrow as pa, numpy as np
> >>> with open('test.pyarrow', 'w') as f:
> ... f.write(pa.serialize(np.arange(10, 
> dtype=np.int32)).to_buffer().to_pybytes())
> ...
> >>> pa.read_serialized(pa.OSFile('test.pyarrow')).deserialize()
> array([54846320, 0, 45484448, 0, 4, 5, 6, 7, 8, 9], dtype=int32)
> >>> pa.deserialize(pa.frombuffer(open('test.pyarrow').read()))
> array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2040) [Python] pyarrow.read_serialized returns bogus data

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381200#comment-16381200
 ] 

ASF GitHub Bot commented on ARROW-2040:
---

cpcloud commented on issue #1680: ARROW-2040: [Python] Deserialized Numpy array 
must keep ref to underlying tensor
URL: https://github.com/apache/arrow/pull/1680#issuecomment-369417728
 
 
   Merging.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] pyarrow.read_serialized returns bogus data
> ---
>
> Key: ARROW-2040
> URL: https://issues.apache.org/jira/browse/ARROW-2040
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Richard Shin
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> pyarrow.deserialize works fine, however.
> {code:python}
> Python 2.7.12 (default, Nov 20 2017, 18:23:56)
> [GCC 5.4.0 20160609] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pyarrow as pa, numpy as np
> >>> with open('test.pyarrow', 'w') as f:
> ... f.write(pa.serialize(np.arange(10, 
> dtype=np.int32)).to_buffer().to_pybytes())
> ...
> >>> pa.read_serialized(pa.OSFile('test.pyarrow')).deserialize()
> array([54846320, 0, 45484448, 0, 4, 5, 6, 7, 8, 9], dtype=int32)
> >>> pa.deserialize(pa.frombuffer(open('test.pyarrow').read()))
> array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2040) [Python] pyarrow.read_serialized returns bogus data

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381202#comment-16381202
 ] 

ASF GitHub Bot commented on ARROW-2040:
---

cpcloud closed pull request #1680: ARROW-2040: [Python] Deserialized Numpy 
array must keep ref to underlying tensor
URL: https://github.com/apache/arrow/pull/1680
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/src/arrow/python/arrow_to_python.cc 
b/cpp/src/arrow/python/arrow_to_python.cc
index 54a71d5a3..5515d24bd 100644
--- a/cpp/src/arrow/python/arrow_to_python.cc
+++ b/cpp/src/arrow/python/arrow_to_python.cc
@@ -94,7 +94,7 @@ Status DeserializeDict(PyObject* context, const Array& array, 
int64_t start_idx,
 Status DeserializeArray(const Array& array, int64_t offset, PyObject* base,
 const SerializedPyObject& blobs, PyObject** out) {
   int32_t index = static_cast(array).Value(offset);
-  RETURN_NOT_OK(py::TensorToNdarray(*blobs.tensors[index], base, out));
+  RETURN_NOT_OK(py::TensorToNdarray(blobs.tensors[index], base, out));
   // Mark the array as immutable
   OwnedRef flags(PyObject_GetAttrString(*out, "flags"));
   DCHECK(flags.obj() != NULL) << "Could not mark Numpy array immutable";
diff --git a/cpp/src/arrow/python/numpy_convert.cc 
b/cpp/src/arrow/python/numpy_convert.cc
index 7ba13877d..0cd616aec 100644
--- a/cpp/src/arrow/python/numpy_convert.cc
+++ b/cpp/src/arrow/python/numpy_convert.cc
@@ -30,6 +30,7 @@
 #include "arrow/type.h"
 
 #include "arrow/python/common.h"
+#include "arrow/python/pyarrow.h"
 #include "arrow/python/type_traits.h"
 
 namespace arrow {
@@ -251,50 +252,54 @@ Status NdarrayToTensor(MemoryPool* pool, PyObject* ao, 
std::shared_ptr*
   return Status::OK();
 }
 
-Status TensorToNdarray(const Tensor& tensor, PyObject* base, PyObject** out) {
+Status TensorToNdarray(const std::shared_ptr& tensor, PyObject* base,
+   PyObject** out) {
   PyAcquireGIL lock;
 
   int type_num;
-  RETURN_NOT_OK(GetNumPyType(*tensor.type(), &type_num));
+  RETURN_NOT_OK(GetNumPyType(*tensor->type(), &type_num));
   PyArray_Descr* dtype = PyArray_DescrNewFromType(type_num);
   RETURN_IF_PYERROR();
 
-  std::vector npy_shape(tensor.ndim());
-  std::vector npy_strides(tensor.ndim());
+  const int ndim = tensor->ndim();
+  std::vector npy_shape(ndim);
+  std::vector npy_strides(ndim);
 
-  for (int i = 0; i < tensor.ndim(); ++i) {
-npy_shape[i] = tensor.shape()[i];
-npy_strides[i] = tensor.strides()[i];
+  for (int i = 0; i < ndim; ++i) {
+npy_shape[i] = tensor->shape()[i];
+npy_strides[i] = tensor->strides()[i];
   }
 
   const void* immutable_data = nullptr;
-  if (tensor.data()) {
-immutable_data = tensor.data()->data();
+  if (tensor->data()) {
+immutable_data = tensor->data()->data();
   }
 
   // Remove const =(
   void* mutable_data = const_cast(immutable_data);
 
   int array_flags = 0;
-  if (tensor.is_row_major()) {
+  if (tensor->is_row_major()) {
 array_flags |= NPY_ARRAY_C_CONTIGUOUS;
   }
-  if (tensor.is_column_major()) {
+  if (tensor->is_column_major()) {
 array_flags |= NPY_ARRAY_F_CONTIGUOUS;
   }
-  if (tensor.is_mutable()) {
+  if (tensor->is_mutable()) {
 array_flags |= NPY_ARRAY_WRITEABLE;
   }
 
   PyObject* result =
-  PyArray_NewFromDescr(&PyArray_Type, dtype, tensor.ndim(), 
npy_shape.data(),
+  PyArray_NewFromDescr(&PyArray_Type, dtype, ndim, npy_shape.data(),
npy_strides.data(), mutable_data, array_flags, 
nullptr);
   RETURN_IF_PYERROR()
 
-  if (base != Py_None) {
-PyArray_SetBaseObject(reinterpret_cast(result), base);
+  if (base == Py_None || base == nullptr) {
+base = py::wrap_tensor(tensor);
+  } else {
 Py_XINCREF(base);
   }
+  PyArray_SetBaseObject(reinterpret_cast(result), base);
   *out = result;
   return Status::OK();
 }
diff --git a/cpp/src/arrow/python/numpy_convert.h 
b/cpp/src/arrow/python/numpy_convert.h
index 220e38f2e..dfdb1acd1 100644
--- a/cpp/src/arrow/python/numpy_convert.h
+++ b/cpp/src/arrow/python/numpy_convert.h
@@ -65,7 +65,8 @@ Status GetNumPyType(const DataType& type, int* type_num);
 ARROW_EXPORT Status NdarrayToTensor(MemoryPool* pool, PyObject* ao,
 std::shared_ptr* out);
 
-ARROW_EXPORT Status TensorToNdarray(const Tensor& tensor, PyObject* base, 
PyObject** out);
+ARROW_EXPORT Status TensorToNdarray(const std::shared_ptr& tensor, 
PyObject* base,
+PyObject** out);
 
 }  // namespace py
 }  // namespace arrow
diff --git a/python/pyarrow/array.pxi b/python/pyarrow/array.pxi
index a43bfb93b..5b8621f13 100644
--- a/python/pyarrow/array.pxi
+++ b/pytho