[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440981#comment-16440981
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

xhochy commented on issue #1886: ARROW-2101: [Python/C++] Correctly convert 
numpy arrays of bytes to arrow arrays of strings when user specifies arrow type 
of string
URL: https://github.com/apache/arrow/pull/1886#issuecomment-382027194
 
 
   @joshuastorck Gave you the necessary karma and assigned the issue to you, 
too.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Assignee: Joshua Storck
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439914#comment-16439914
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

pitrou commented on issue #1886: ARROW-2101: [Python/C++] Correctly convert 
numpy arrays of bytes to arrow arrays of strings when user specifies arrow type 
of string
URL: https://github.com/apache/arrow/pull/1886#issuecomment-381720123
 
 
   I'm not able to it either, but I think @xhochy  is :-)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439912#comment-16439912
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

BryanCutler commented on issue #1886: ARROW-2101: [Python/C++] Correctly 
convert numpy arrays of bytes to arrow arrays of strings when user specifies 
arrow type of string
URL: https://github.com/apache/arrow/pull/1886#issuecomment-381719872
 
 
   It looks like you need to be given rights to have issues assigned, and I 
guess I'm not able to do that.  @pitrou or @xhochy , would you mind doing this?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439856#comment-16439856
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

joshuastorck commented on issue #1886: ARROW-2101: [Python/C++] Correctly 
convert numpy arrays of bytes to arrow arrays of strings when user specifies 
arrow type of string
URL: https://github.com/apache/arrow/pull/1886#issuecomment-381708268
 
 
   @BryanCutler, my JIRA username is joshuastorck


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439743#comment-16439743
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

BryanCutler commented on issue #1886: ARROW-2101: [Python/C++] Correctly 
convert numpy arrays of bytes to arrow arrays of strings when user specifies 
arrow type of string
URL: https://github.com/apache/arrow/pull/1886#issuecomment-381682065
 
 
   @joshuastorck , what is your JIRA username so I can assign the issue to you?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439735#comment-16439735
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

BryanCutler commented on issue #1886: ARROW-2101: [Python/C++] Correctly 
convert numpy arrays of bytes to arrow arrays of strings when user specifies 
arrow type of string
URL: https://github.com/apache/arrow/pull/1886#issuecomment-381681329
 
 
   Thanks for the clarification of Python 2 behaviour @xhochy , and thanks for 
the fix @joshuastorck ! 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439227#comment-16439227
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

pitrou commented on issue #1886: ARROW-2101: [Python/C++] Correctly convert 
numpy arrays of bytes to arrow arrays of strings when user specifies arrow type 
of string
URL: https://github.com/apache/arrow/pull/1886#issuecomment-381550267
 
 
   Thank you @joshuastorck !


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439224#comment-16439224
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

pitrou closed pull request #1886: ARROW-2101: [Python/C++] Correctly convert 
numpy arrays of bytes to arrow arrays of strings when user specifies arrow type 
of string
URL: https://github.com/apache/arrow/pull/1886
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/src/arrow/python/numpy_to_arrow.cc 
b/cpp/src/arrow/python/numpy_to_arrow.cc
index e37013c7e..dcb96a48a 100644
--- a/cpp/src/arrow/python/numpy_to_arrow.cc
+++ b/cpp/src/arrow/python/numpy_to_arrow.cc
@@ -228,12 +228,15 @@ static Status AppendObjectBinaries(PyArrayObject* arr, 
PyArrayObject* mask,
 /// can fit
 ///
 /// \param[in] offset starting offset for appending
+/// \param[in] check_valid if set to true and the input array
+/// contains values that cannot be converted to unicode, returns
+/// a Status code containing a Python exception message
 /// \param[out] end_offset ending offset where we stopped appending. Will
 /// be length of arr if fully consumed
 /// \param[out] have_bytes true if we encountered any PyBytes object
 static Status AppendObjectStrings(PyArrayObject* arr, PyArrayObject* mask, 
int64_t offset,
-  StringBuilder* builder, int64_t* end_offset,
-  bool* have_bytes) {
+  bool check_valid, StringBuilder* builder,
+  int64_t* end_offset, bool* have_bytes) {
   PyObject* obj;
 
   Ndarray1DIndexer objects(arr);
@@ -256,8 +259,7 @@ static Status AppendObjectStrings(PyArrayObject* arr, 
PyArrayObject* mask, int64
   *have_bytes = true;
 }
 bool is_full;
-RETURN_NOT_OK(
-internal::BuilderAppend(builder, obj, false /* check_valid */, 
_full));
+RETURN_NOT_OK(internal::BuilderAppend(builder, obj, check_valid, 
_full));
 if (is_full) {
   break;
 }
@@ -844,6 +846,13 @@ Status NumPyConverter::ConvertObjectStrings() {
   StringBuilder builder(pool_);
   RETURN_NOT_OK(builder.Resize(length_));
 
+  // If the creator of this NumPyConverter specified a type,
+  // then we want to force the output type to be utf8. If
+  // the input data is PyBytes and not PyUnicode and
+  // not convertible to utf8, the call to AppendObjectStrings
+  // below will fail because we pass force_string as the
+  // value for check_valid.
+  bool force_string = type_ != nullptr && type_->Equals(utf8());
   bool global_have_bytes = false;
   if (length_ == 0) {
 // Produce an empty chunk
@@ -854,8 +863,10 @@ Status NumPyConverter::ConvertObjectStrings() {
 int64_t offset = 0;
 while (offset < length_) {
   bool chunk_have_bytes = false;
-  RETURN_NOT_OK(
-  AppendObjectStrings(arr_, mask_, offset, , , 
_have_bytes));
+  // Always set check_valid to true when force_string is true
+  RETURN_NOT_OK(AppendObjectStrings(arr_, mask_, offset,
+force_string /* check_valid */, 
, ,
+_have_bytes));
 
   global_have_bytes = global_have_bytes | chunk_have_bytes;
   std::shared_ptr chunk;
@@ -864,8 +875,13 @@ Status NumPyConverter::ConvertObjectStrings() {
 }
   }
 
-  // If we saw PyBytes, convert everything to BinaryArray
-  if (global_have_bytes) {
+  // If we saw bytes, convert it to a binary array. If
+  // force_string was set to true, the input data could
+  // have been bytes but we've checked to make sure that
+  // it can be converted to utf-8 in the call to
+  // AppendObjectStrings. In that case, we can safely leave
+  // it as a utf8 type.
+  if (!force_string && global_have_bytes) {
 for (size_t i = 0; i < out_arrays_.size(); ++i) {
   auto binary_data = out_arrays_[i]->data()->Copy();
   binary_data->type = ::arrow::binary();
@@ -1393,8 +1409,12 @@ inline Status 
NumPyConverter::ConvertTypedLists(
   RETURN_NOT_OK(CheckFlatNumpyArray(numpy_array, NPY_OBJECT));
 
   int64_t offset = 0;
-  RETURN_NOT_OK(AppendObjectStrings(numpy_array, nullptr, 0, 
value_builder, ,
-_bytes));
+  // If a type was specified and it was utf8, then we set
+  // check_valid to true. If any of the input cannot be
+  // converted, then we will exit early here.
+  bool check_valid = type_ != nullptr && type_->Equals(::arrow::utf8());
+  RETURN_NOT_OK(AppendObjectStrings(numpy_array, nullptr, 0, check_valid,
+value_builder, , _bytes));
   if (offset < 

[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438794#comment-16438794
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

pitrou commented on issue #1886: ARROW-2101: [Python/C++] Correctly convert 
numpy arrays of bytes to arrow arrays of strings when user specifies arrow type 
of string
URL: https://github.com/apache/arrow/pull/1886#issuecomment-381426651
 
 
   I'm not saying we should necessarily make it faster, just wanted to make 
sure people are aware of the inefficiency.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438721#comment-16438721
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

joshuastorck commented on issue #1886: ARROW-2101: [Python/C++] Correctly 
convert numpy arrays of bytes to arrow arrays of strings when user specifies 
arrow type of string
URL: https://github.com/apache/arrow/pull/1886#issuecomment-381410530
 
 
   @pitrou, on second look it won't be more efficient to move the check to 
outside of AppendObjectStrings. When passing check_valid to 
AppendObjectStrings, the UTF-8 decoding/check only happens if the data is 
Python 3 bytes or Python 2 strings. However, if the user passes Python 3 
strings or Python 2 unicode and wants a string type, no extra checks are done. 
In the case where the user wants the output type to be an arrow string, then we 
need to do the check on each bytes object. Otherwise, we will return a 
StringArray that has data that's not actually UTF-8.
   
   Please let me know if that makes sense, and if not, let me know how you 
would make it faster. 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438715#comment-16438715
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

joshuastorck commented on issue #1886: ARROW-2101: [Python/C++] Correctly 
convert numpy arrays of bytes to arrow arrays of strings when user specifies 
arrow type of string
URL: https://github.com/apache/arrow/pull/1886#issuecomment-381263228
 
 
   I built for Python 2 and confirmed the behavior is the same. 
   
   @pitrou, in regards to the inefficiency of utf-8 encoding, it could be moved 
below to the check of global_have_bytes. Would you prefer this?
   
   ```cpp
 if (global_have_bytes) {
   if (force_string)
   {
   PyObject* obj;
   
Ndarray1DIndexer objects(arr_);
Ndarray1DIndexer mask_values;

bool have_mask = false;
if (mask_ != nullptr) {
  mask_values.Init(mask_);
  have_mask = true;
}

PyBytesView view;
for (int64_t offset = 0; offset < objects.size(); ++offset) {
  OwnedRef tmp_obj;
  obj = objects[offset];
  if ((have_mask && mask_values[offset]) || 
internal::PandasObjectIsNull(obj)) {
continue;
  }
 RETURN_NOT_OK(view.FromString(obj, true);
}
   }
   else
   {
 for (size_t i = 0; i < out_arrays_.size(); ++i) {
auto binary_data = out_arrays_[i]->data()->Copy();c
binary_data->type = ::arrow::binary();
out_arrays_[i] = std::make_shared(binary_data);
 }
   }
   ```
   
   I'm not fond of how much code I had to copy from AppendObjectStrings to 
write that loop. I think it would be helpful to have iterators that look like 
this:
   
   ```cpp
   NdArray1DIndexer array(array_);
   auto mask = NdArray1DIndexer::from_mask(mask_);
   NdArray1DMaskedIterator iterator(array.begin() + offset, array.end(), mask, 
true /* include masked value */);
   for (OwnedRef& obj: iterator)
   {
  // Maybe we use None to indicate masked values?
   }
   ```
   Or even better, we use pybind11 and these are light wrappers over them?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438623#comment-16438623
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

xhochy commented on issue #1886: ARROW-2101: [Python/C++] Correctly convert 
numpy arrays of bytes to arrow arrays of strings when user specifies arrow type 
of string
URL: https://github.com/apache/arrow/pull/1886#issuecomment-381388117
 
 
   Not sure if there were more comment on it, but just want to iterate on 
   
   >> Also, this doesn't change anything for Python 2 if using 'str' objects 
and the type is not specified, it will still create a BinaryArray, is this what 
we want?
   
   > Probably. Python 2 str objects are bytestrings just like Python 3 bytes 
objects.
   
   Yes this is definitely the indented behaviour. We had some discussion in the 
past about it and stuck to the following when no type is specified:
   
   ```
   str(PY2) / bytes(PY3) –> pa.binary
   unicode(PY2) / str(PY3) –> pa.string
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437947#comment-16437947
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

pitrou commented on issue #1886: ARROW-2101: [Python/C++] Correctly convert 
numpy arrays of bytes to arrow arrays of strings when user specifies arrow type 
of string
URL: https://github.com/apache/arrow/pull/1886#issuecomment-381264065
 
 
   > I think it would be helpful to have iterators that look like this:
   
   Probably, though that would be another PR :-)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437944#comment-16437944
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

pitrou commented on issue #1886: ARROW-2101: [Python/C++] Correctly convert 
numpy arrays of bytes to arrow arrays of strings when user specifies arrow type 
of string
URL: https://github.com/apache/arrow/pull/1886#issuecomment-381263863
 
 
   @joshuastorck The utf8 decoding check is in `BuilderAppend(StringBuilder*, 
...)`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437938#comment-16437938
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

joshuastorck commented on issue #1886: ARROW-2101: [Python/C++] Correctly 
convert numpy arrays of bytes to arrow arrays of strings when user specifies 
arrow type of string
URL: https://github.com/apache/arrow/pull/1886#issuecomment-381263228
 
 
   I built for Python 2 and confirmed the behavior is the same. 
   
   @pitrou, in regards to the inefficiency of utf-8 encoding, it could be moved 
below to the check of global_have_bytes. Would you prefer this?
   
   ```cpp
 if (global_have_bytes) {
   if (force_string)
   {
   PyObject* obj;
   
Ndarray1DIndexer objects(arr_);
Ndarray1DIndexer mask_values;

bool have_mask = false;
if (mask_ != nullptr) {
  mask_values.Init(mask_);
  have_mask = true;
}

for (int64_t offset = 0; offset < objects.size(); ++offset) {
  OwnedRef tmp_obj;
  obj = objects[offset];
  if ((have_mask && mask_values[offset]) || 
internal::PandasObjectIsNull(obj)) {
continue;
  }
  OwnedRef(PyUnicode_AsUTF8String(obj));
  RETURN_IF_PYERROR();
}
   }
   else
   {
 for (size_t i = 0; i < out_arrays_.size(); ++i) {
auto binary_data = out_arrays_[i]->data()->Copy();c
binary_data->type = ::arrow::binary();
out_arrays_[i] = std::make_shared(binary_data);
 }
   }
   ```
   
   I'm not fond of how much code I had to copy from AppendObjectStrings to 
write that loop. I think it would be helpful to have iterators that look like 
this:
   
   ```cpp
   NdArray1DIndexer array(array_);
   auto mask = NdArray1DIndexer::from_mask(mask_);
   NdArray1DMaskedIterator iterator(array.begin() + offset, array.end(), mask, 
true /* include masked value */);
   for (OwnedRef& obj: iterator)
   {
  // Maybe we use None to indicate masked values?
   }
   ```
   Or even better, we use pybind11 and these are light wrappers over them?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437757#comment-16437757
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

pitrou commented on a change in pull request #1886: ARROW-2101: [Python/C++] 
Correctly convert numpy arrays of bytes to arrow arrays of strings when user 
specifies arrow type of string
URL: https://github.com/apache/arrow/pull/1886#discussion_r181482519
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -844,6 +846,13 @@ Status NumPyConverter::ConvertObjectStrings() {
   StringBuilder builder(pool_);
   RETURN_NOT_OK(builder.Resize(length_));
 
+  // If the creator of this NumPyConverter specified a type,
+  // then we want to force the output type to be utf8. If
+  // the input data is PyBytes and not PyUnicode and
+  // not convertible to utf8, the call to AppendObjectStrings
+  // below will fail because we pass force_string as the
+  // value for check_valid.
+  bool force_string = type_ != std::nullptr && type_->Equals(utf8());
 
 Review comment:
   Apparently some compilers don't like `std::nullptr`. Just use `type_ != 
nullptr`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437752#comment-16437752
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

pitrou commented on issue #1886: ARROW-2101: [Python/C++] Correctly convert 
numpy arrays of bytes to arrow arrays of strings when user specifies arrow type 
of string
URL: https://github.com/apache/arrow/pull/1886#issuecomment-381231182
 
 
   By the way, the validity check is expensive since it utf8-decodes the 
bytestring.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437748#comment-16437748
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

pitrou commented on issue #1886: ARROW-2101: [Python/C++] Correctly convert 
numpy arrays of bytes to arrow arrays of strings when user specifies arrow type 
of string
URL: https://github.com/apache/arrow/pull/1886#issuecomment-381230244
 
 
   > Also, this doesn't change anything for Python 2 if using 'str' objects and 
the type is not specified, it will still create a BinaryArray, is this what we 
want?
   
   *Probably*. Python 2 `str` objects are bytestrings just like Python 3 
`bytes` objects.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437279#comment-16437279
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

pitrou commented on a change in pull request #1886: Bug fix for ARROW-2101
URL: https://github.com/apache/arrow/pull/1886#discussion_r181381388
 
 

 ##
 File path: python/pyarrow/tests/test_convert_numpy.py
 ##
 @@ -0,0 +1,35 @@
+# -*- coding: utf-8 -*-
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import numpy as np
+import pyarrow as pa
+
+import pytest
+
+# Regression test for ARROW-2101
+def test_convert_numpy_array_of_bytes_to_arrow_array_of_strings():
+converted = pa.array(np.array([b'x'], dtype=object), pa.string())
+assert converted.type == pa.string()
+
+# Make sure that if an ndarray of bytes is passed to the array
+# constructor and the type is string, it will fail if those bytes
+# cannot be converted to utf-8
+def test_convert_numpy_array_of_bytes_to_arrow_array_of_strings_bad_data():
+with pytest.raises(pa.lib.ArrowException,
+   message="Unknown error: 'utf-8' codec can't decode byte 
0x80 in position 0: invalid start byte"):
+pa.array(np.array([b'\x80\x81'], dtype=object), pa.string())
 
 Review comment:
   Indeed. Also I don't think we need both Python and C++ tests. Given the 
difference in verbosity and maintainability, I'd favour writing the tests on 
the Python side.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436552#comment-16436552
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

BryanCutler commented on a change in pull request #1886: Bug fix for ARROW-2101
URL: https://github.com/apache/arrow/pull/1886#discussion_r181258704
 
 

 ##
 File path: python/pyarrow/tests/test_convert_numpy.py
 ##
 @@ -0,0 +1,35 @@
+# -*- coding: utf-8 -*-
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import numpy as np
+import pyarrow as pa
+
+import pytest
+
+# Regression test for ARROW-2101
+def test_convert_numpy_array_of_bytes_to_arrow_array_of_strings():
+converted = pa.array(np.array([b'x'], dtype=object), pa.string())
+assert converted.type == pa.string()
+
+# Make sure that if an ndarray of bytes is passed to the array
+# constructor and the type is string, it will fail if those bytes
+# cannot be converted to utf-8
+def test_convert_numpy_array_of_bytes_to_arrow_array_of_strings_bad_data():
+with pytest.raises(pa.lib.ArrowException,
+   message="Unknown error: 'utf-8' codec can't decode byte 
0x80 in position 0: invalid start byte"):
+pa.array(np.array([b'\x80\x81'], dtype=object), pa.string())
 
 Review comment:
   I think these tests would be fine in 'test_convert_pandas'


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436545#comment-16436545
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

BryanCutler commented on a change in pull request #1886: Bug fix for ARROW-2101
URL: https://github.com/apache/arrow/pull/1886#discussion_r181258433
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -1393,7 +1412,13 @@ inline Status 
NumPyConverter::ConvertTypedLists(
   RETURN_NOT_OK(CheckFlatNumpyArray(numpy_array, NPY_OBJECT));
 
   int64_t offset = 0;
-  RETURN_NOT_OK(AppendObjectStrings(numpy_array, nullptr, 0, 
value_builder, ,
+  // If a type was specified and it was utf8, then we set
+  // check_valid to true. If any of the input cannot be
+  // converted, then we will exit early here.
+  auto check_valid = type_ != 0 && type_->Equals(::arrow::utf8());
 
 Review comment:
   no need to use `auto` here


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436517#comment-16436517
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

BryanCutler commented on issue #1886: Bug fix for ARROW-2101
URL: https://github.com/apache/arrow/pull/1886#issuecomment-380981795
 
 
   Just so I have this straight, the old behavior was when the user specifies 
an explicit type as `pa.string()` and a binary object was found, it would 
fallback to `BinaryArray` and continue.  This changes it to try to convert the 
object to utf-8 and raises an error if it fails, only if the type is specified?
   
   Does anyone know if there was a reason to fallback in this case?  I think 
this change makes sense, but just want to make sure we are not breaking 
anything.
   
   Also, this doesn't change anything for Python 2 if using 'str' objects and 
the type is not specified, it will still create a `BinaryArray`, is this what 
we want?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436491#comment-16436491
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

BryanCutler commented on issue #1886: Bug fix for ARROW-2101
URL: https://github.com/apache/arrow/pull/1886#issuecomment-380977087
 
 
   Thanks for the PR @joshuastorck !  Could you please update the title to 
start with "ARROW-2101: [Python] ..." and make it and the description a little 
more informative rather than just referencing the JIRA?  Also, in the future, 
could you assign or make a comment in the JIRA that you are working on it to 
let people know and prevent duplicate efforts?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436389#comment-16436389
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

cpcloud commented on a change in pull request #1886: Bug fix for ARROW-2101
URL: https://github.com/apache/arrow/pull/1886#discussion_r181238447
 
 

 ##
 File path: cpp/src/arrow/python/python-test.cc
 ##
 @@ -383,5 +384,57 @@ TEST(PythonTest, ConstructStringArrayWithLeadingZeros) {
   ASSERT_OK(ConvertPySequence(list, pool, ));
 }
 
+class NdarrayToArrowTest: public ::testing::Test {
+
+public:
+
+  void CreateNdarrayWithOneString(const char* value, OwnedRef& ref)
+  {
+npy_intp dims[1];
+dims[0] = 1;
+auto array_object = PyArray_SimpleNew(1, dims, NPY_OBJECT);
+auto array = reinterpret_cast(array_object);
+ASSERT_TRUE(array != 0);
+dims[0] = 0;
+auto dest = PyArray_GetPtr(array, dims);
+auto bytes_object = PyBytes_FromString(value);
+ASSERT_NE(-1, PyArray_SETITEM(array, reinterpret_cast(dest), 
bytes_object));
 
 Review comment:
   Does this steal a reference?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436390#comment-16436390
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

cpcloud commented on a change in pull request #1886: Bug fix for ARROW-2101
URL: https://github.com/apache/arrow/pull/1886#discussion_r181238494
 
 

 ##
 File path: cpp/src/arrow/python/python-test.cc
 ##
 @@ -383,5 +384,57 @@ TEST(PythonTest, ConstructStringArrayWithLeadingZeros) {
   ASSERT_OK(ConvertPySequence(list, pool, ));
 }
 
+class NdarrayToArrowTest: public ::testing::Test {
+
+public:
+
+  void CreateNdarrayWithOneString(const char* value, OwnedRef& ref)
+  {
+npy_intp dims[1];
+dims[0] = 1;
+auto array_object = PyArray_SimpleNew(1, dims, NPY_OBJECT);
+auto array = reinterpret_cast(array_object);
+ASSERT_TRUE(array != 0);
+dims[0] = 0;
+auto dest = PyArray_GetPtr(array, dims);
+auto bytes_object = PyBytes_FromString(value);
+ASSERT_NE(-1, PyArray_SETITEM(array, reinterpret_cast(dest), 
bytes_object));
+
+Py_XDECREF(bytes_object);
+
+ref.reset(array_object);
+  }
+
+};
+
+// Regression for ARROW-2101
+TEST_F(NdarrayToArrowTest, BytesToStringWhenTypeSpecified)
+{
+  OwnedRef array;
+  this->CreateNdarrayWithOneString("x", array);
+
+  auto arrow_type = ::arrow::utf8();
+  std::shared_ptr arrow_array;
+  ASSERT_OK(NdarrayToArrow(default_memory_pool(), 
reinterpret_cast(array.obj()), 0,
+  true, arrow_type, _array));
 
 Review comment:
   `make format`


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436382#comment-16436382
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

cpcloud commented on a change in pull request #1886: Bug fix for ARROW-2101
URL: https://github.com/apache/arrow/pull/1886#discussion_r181237392
 
 

 ##
 File path: cpp/src/arrow/python/python-test.cc
 ##
 @@ -383,5 +384,57 @@ TEST(PythonTest, ConstructStringArrayWithLeadingZeros) {
   ASSERT_OK(ConvertPySequence(list, pool, ));
 }
 
+class NdarrayToArrowTest: public ::testing::Test {
+
+public:
+
+  void CreateNdarrayWithOneString(const char* value, OwnedRef& ref)
+  {
+npy_intp dims[1];
+dims[0] = 1;
+auto array_object = PyArray_SimpleNew(1, dims, NPY_OBJECT);
+auto array = reinterpret_cast(array_object);
+ASSERT_TRUE(array != 0);
+dims[0] = 0;
+auto dest = PyArray_GetPtr(array, dims);
+auto bytes_object = PyBytes_FromString(value);
 
 Review comment:
   Should we assert that this isn't `nullptr`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436379#comment-16436379
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

cpcloud commented on a change in pull request #1886: Bug fix for ARROW-2101
URL: https://github.com/apache/arrow/pull/1886#discussion_r181237063
 
 

 ##
 File path: cpp/src/arrow/python/python-test.cc
 ##
 @@ -383,5 +384,57 @@ TEST(PythonTest, ConstructStringArrayWithLeadingZeros) {
   ASSERT_OK(ConvertPySequence(list, pool, ));
 }
 
+class NdarrayToArrowTest: public ::testing::Test {
 
 Review comment:
   You'll probably get a format check error because of missing a space before 
the colon after `ArrowTest`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436376#comment-16436376
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

cpcloud commented on a change in pull request #1886: Bug fix for ARROW-2101
URL: https://github.com/apache/arrow/pull/1886#discussion_r181236836
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -1393,7 +1412,13 @@ inline Status 
NumPyConverter::ConvertTypedLists(
   RETURN_NOT_OK(CheckFlatNumpyArray(numpy_array, NPY_OBJECT));
 
   int64_t offset = 0;
-  RETURN_NOT_OK(AppendObjectStrings(numpy_array, nullptr, 0, 
value_builder, ,
+  // If a type was specified and it was utf8, then we set
+  // check_valid to true. If any of the input cannot be
+  // converted, then we will exit early here.
+  auto check_valid = type_ != 0 && type_->Equals(::arrow::utf8());
+  RETURN_NOT_OK(AppendObjectStrings(numpy_array, nullptr, 0,
+   check_valid,
+   value_builder, ,
 
 Review comment:
   `make format` for this as well.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436375#comment-16436375
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

cpcloud commented on a change in pull request #1886: Bug fix for ARROW-2101
URL: https://github.com/apache/arrow/pull/1886#discussion_r181236801
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -1393,7 +1412,13 @@ inline Status 
NumPyConverter::ConvertTypedLists(
   RETURN_NOT_OK(CheckFlatNumpyArray(numpy_array, NPY_OBJECT));
 
   int64_t offset = 0;
-  RETURN_NOT_OK(AppendObjectStrings(numpy_array, nullptr, 0, 
value_builder, ,
+  // If a type was specified and it was utf8, then we set
+  // check_valid to true. If any of the input cannot be
+  // converted, then we will exit early here.
+  auto check_valid = type_ != 0 && type_->Equals(::arrow::utf8());
 
 Review comment:
   `nullptr` here as well


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436368#comment-16436368
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

cpcloud commented on a change in pull request #1886: Bug fix for ARROW-2101
URL: https://github.com/apache/arrow/pull/1886#discussion_r181236498
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -228,11 +228,15 @@ static Status AppendObjectBinaries(PyArrayObject* arr, 
PyArrayObject* mask,
 /// can fit
 ///
 /// \param[in] offset starting offset for appending
+/// \param[in] check_valid if set to true and the input array
+/// contains values that cannot be converted to unicode, returns
+/// a Status code containing a Python exception message
 /// \param[out] end_offset ending offset where we stopped appending. Will
 /// be length of arr if fully consumed
 /// \param[out] have_bytes true if we encountered any PyBytes object
 static Status AppendObjectStrings(PyArrayObject* arr, PyArrayObject* mask, 
int64_t offset,
-  StringBuilder* builder, int64_t* end_offset,
+  bool check_valid, StringBuilder* builder,
+ int64_t* end_offset,
 
 Review comment:
   `make format` should take care of this.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436215#comment-16436215
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

joshuastorck opened a new pull request #1886: Bug fix for ARROW-2101
URL: https://github.com/apache/arrow/pull/1886
 
 
   See https://issues.apache.org/jira/browse/ARROW-2101.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-05 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16427646#comment-16427646
 ] 

Antoine Pitrou commented on ARROW-2101:
---

{{pa.string()}} should definitely return a string array IMHO.

Note the following discrepancy (Python 3):
{code:python}
>>> pa.array([b'x'], type=pa.string())

[
  'x'
]
>>> pa.array(np.array([b'x'], dtype=object), type=pa.string())

[
  b'x'
]
{code}

> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-05 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16427643#comment-16427643
 ] 

Bryan Cutler commented on ARROW-2101:
-

I'm not exactly sure what the right behavior for this is.  If you are just 
staying in a Python 2 env, then it makes sense that {{str}} would correspond to 
a binary type column and it doesn't cause a problem going to/from Pandas.  In 
PySpark, we are sending Arrow data to Java and it seems like specifying 
{{pa.string()}} should lead to a string column in Java (a VarCharVector).  
Currently, we do a check and "decode('utf-8')" before calling {{from_pandas}} 
which works fine, but I thought maybe that pyarrow should perform a conversion 
to UTF-8 automatically if the user specifies the type. However, it's also 
possible that casting to UTF-8 wouldn't be wanted, e.g. if you are only in a 
Python 2 env and don't mind treating it as a binary column.

I guess to sum this issue up, is {{pa.string()}} meant to correspond to an 
Arrow UTF-8 type or is it special in this case as it depends on if your {{str}} 
came from Python 2 or 3?  cc [~wesmckinn] [~xhochy] [~pitrou] for thoughts

> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-05 Thread Krisztian Szucs (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16427462#comment-16427462
 ] 

Krisztian Szucs commented on ARROW-2101:


I thins here 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/numpy_to_arrow.cc#L839
is the relevant code, the comments nicely explain what happens. 

The example with explicit string datatype presumes the opposite direction.
 Of course a high level decode('utf-8') would work. What is the preferred way 
to do this kind of conversions?

> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)