[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

ASF GitHub Bot (JIRA) Mon, 16 Apr 2018 03:12:19 -0700

    [ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439224#comment-16439224
 ]


ASF GitHub Bot commented on ARROW-2101:
---------------------------------------

pitrou closed pull request #1886: ARROW-2101: [Python/C++] Correctly convert 
numpy arrays of bytes to arrow arrays of strings when user specifies arrow type 
of string
URL: https://github.com/apache/arrow/pull/1886
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/src/arrow/python/numpy_to_arrow.cc 
b/cpp/src/arrow/python/numpy_to_arrow.cc
index e37013c7e..dcb96a48a 100644
--- a/cpp/src/arrow/python/numpy_to_arrow.cc
+++ b/cpp/src/arrow/python/numpy_to_arrow.cc
@@ -228,12 +228,15 @@ static Status AppendObjectBinaries(PyArrayObject* arr, 
PyArrayObject* mask,
 /// can fit
 ///
 /// \param[in] offset starting offset for appending
+/// \param[in] check_valid if set to true and the input array
+/// contains values that cannot be converted to unicode, returns
+/// a Status code containing a Python exception message
 /// \param[out] end_offset ending offset where we stopped appending. Will
 /// be length of arr if fully consumed
 /// \param[out] have_bytes true if we encountered any PyBytes object
 static Status AppendObjectStrings(PyArrayObject* arr, PyArrayObject* mask, 
int64_t offset,
-                                  StringBuilder* builder, int64_t* end_offset,
-                                  bool* have_bytes) {
+                                  bool check_valid, StringBuilder* builder,
+                                  int64_t* end_offset, bool* have_bytes) {
   PyObject* obj;
 
   Ndarray1DIndexer<PyObject*> objects(arr);
@@ -256,8 +259,7 @@ static Status AppendObjectStrings(PyArrayObject* arr, 
PyArrayObject* mask, int64
       *have_bytes = true;
     }
     bool is_full;
-    RETURN_NOT_OK(
-        internal::BuilderAppend(builder, obj, false /* check_valid */, 
&is_full));
+    RETURN_NOT_OK(internal::BuilderAppend(builder, obj, check_valid, 
&is_full));
     if (is_full) {
       break;
     }
@@ -844,6 +846,13 @@ Status NumPyConverter::ConvertObjectStrings() {
   StringBuilder builder(pool_);
   RETURN_NOT_OK(builder.Resize(length_));
 
+  // If the creator of this NumPyConverter specified a type,
+  // then we want to force the output type to be utf8. If
+  // the input data is PyBytes and not PyUnicode and
+  // not convertible to utf8, the call to AppendObjectStrings
+  // below will fail because we pass force_string as the
+  // value for check_valid.
+  bool force_string = type_ != nullptr && type_->Equals(utf8());
   bool global_have_bytes = false;
   if (length_ == 0) {
     // Produce an empty chunk
@@ -854,8 +863,10 @@ Status NumPyConverter::ConvertObjectStrings() {
     int64_t offset = 0;
     while (offset < length_) {
       bool chunk_have_bytes = false;
-      RETURN_NOT_OK(
-          AppendObjectStrings(arr_, mask_, offset, &builder, &offset, 
&chunk_have_bytes));
+      // Always set check_valid to true when force_string is true
+      RETURN_NOT_OK(AppendObjectStrings(arr_, mask_, offset,
+                                        force_string /* check_valid */, 
&builder, &offset,
+                                        &chunk_have_bytes));
 
       global_have_bytes = global_have_bytes | chunk_have_bytes;
       std::shared_ptr<Array> chunk;
@@ -864,8 +875,13 @@ Status NumPyConverter::ConvertObjectStrings() {
     }
   }
 
-  // If we saw PyBytes, convert everything to BinaryArray
-  if (global_have_bytes) {
+  // If we saw bytes, convert it to a binary array. If
+  // force_string was set to true, the input data could
+  // have been bytes but we've checked to make sure that
+  // it can be converted to utf-8 in the call to
+  // AppendObjectStrings. In that case, we can safely leave
+  // it as a utf8 type.
+  if (!force_string && global_have_bytes) {
     for (size_t i = 0; i < out_arrays_.size(); ++i) {
       auto binary_data = out_arrays_[i]->data()->Copy();
       binary_data->type = ::arrow::binary();
@@ -1393,8 +1409,12 @@ inline Status 
NumPyConverter::ConvertTypedLists<NPY_OBJECT, StringType>(
       RETURN_NOT_OK(CheckFlatNumpyArray(numpy_array, NPY_OBJECT));
 
       int64_t offset = 0;
-      RETURN_NOT_OK(AppendObjectStrings(numpy_array, nullptr, 0, 
value_builder, &offset,
-                                        &have_bytes));
+      // If a type was specified and it was utf8, then we set
+      // check_valid to true. If any of the input cannot be
+      // converted, then we will exit early here.
+      bool check_valid = type_ != nullptr && type_->Equals(::arrow::utf8());
+      RETURN_NOT_OK(AppendObjectStrings(numpy_array, nullptr, 0, check_valid,
+                                        value_builder, &offset, &have_bytes));
       if (offset < PyArray_SIZE(numpy_array)) {
         return Status::Invalid("Array cell value exceeded 2GB");
       }
diff --git a/python/pyarrow/tests/test_convert_pandas.py 
b/python/pyarrow/tests/test_convert_pandas.py
index c6e2b75be..83a6c458c 100644
--- a/python/pyarrow/tests/test_convert_pandas.py
+++ b/python/pyarrow/tests/test_convert_pandas.py
@@ -1188,6 +1188,21 @@ def test_table_str_to_categorical_with_na(self):
             table.to_pandas(strings_to_categorical=True,
                             zero_copy_only=True)
 
+    # Regression test for ARROW-2101
+    def test_array_of_bytes_to_strings(self):
+        converted = pa.array(np.array([b'x'], dtype=object), pa.string())
+        assert converted.type == pa.string()
+
+    # Make sure that if an ndarray of bytes is passed to the array
+    # constructor and the type is string, it will fail if those bytes
+    # cannot be converted to utf-8
+    def test_array_of_bytes_to_strings_bad_data(self):
+        with pytest.raises(
+                pa.lib.ArrowException,
+                message="Unknown error: 'utf-8' codec can't decode byte 0x80 "
+                "in position 0: invalid start byte"):
+            pa.array(np.array([b'\x80\x81'], dtype=object), pa.string())
+
 
 class TestConvertDecimalTypes(object):
     """


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> ------------------------------------------------------------------------
>
>                 Key: ARROW-2101
>                 URL: https://issues.apache.org/jira/browse/ARROW-2101
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.8.0
>            Reporter: Bryan Cutler
>            Assignee: Bryan Cutler
>            Priority: Major
>              Labels: pull-request-available
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

Reply via email to