[jira] [Commented] (ARROW-1691) [Java] Conform Java Decimal type implementation to format decisions in ARROW-1588

2017-11-13 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250248#comment-16250248
 ] 

Phillip Cloud commented on ARROW-1691:
--

Yep, resolved in https://github.com/apache/arrow/pull/1267.

> [Java] Conform Java Decimal type implementation to format decisions in 
> ARROW-1588
> -
>
> Key: ARROW-1691
> URL: https://issues.apache.org/jira/browse/ARROW-1691
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Wes McKinney
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (ARROW-1691) [Java] Conform Java Decimal type implementation to format decisions in ARROW-1588

2017-11-13 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud resolved ARROW-1691.
--
Resolution: Fixed

> [Java] Conform Java Decimal type implementation to format decisions in 
> ARROW-1588
> -
>
> Key: ARROW-1691
> URL: https://issues.apache.org/jira/browse/ARROW-1691
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Wes McKinney
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1811) Rename all Decimal based APIs to Decima128

2017-11-14 Thread Phillip Cloud (JIRA)
Phillip Cloud created ARROW-1811:


 Summary: Rename all Decimal based APIs to Decima128
 Key: ARROW-1811
 URL: https://issues.apache.org/jira/browse/ARROW-1811
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Affects Versions: 0.7.1
Reporter: Phillip Cloud
Assignee: Phillip Cloud
 Fix For: 0.8.0






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1811) [C++/Python] Rename all Decimal based APIs to Decima128

2017-11-14 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud updated ARROW-1811:
-
Summary: [C++/Python] Rename all Decimal based APIs to Decima128  (was: 
Rename all Decimal based APIs to Decima128)

> [C++/Python] Rename all Decimal based APIs to Decima128
> ---
>
> Key: ARROW-1811
> URL: https://issues.apache.org/jira/browse/ARROW-1811
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.7.1
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1811) [C++/Python] Rename all Decimal based APIs to Decimal128

2017-11-14 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud updated ARROW-1811:
-
Summary: [C++/Python] Rename all Decimal based APIs to Decimal128  (was: 
[C++/Python] Rename all Decimal based APIs to Decima128)

> [C++/Python] Rename all Decimal based APIs to Decimal128
> 
>
> Key: ARROW-1811
> URL: https://issues.apache.org/jira/browse/ARROW-1811
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.7.1
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (ARROW-1814) [C++] Determine whether we need to implement BYTE_ARRAY-backed Decimal reads

2017-11-15 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud closed ARROW-1814.

Resolution: Invalid

> [C++] Determine whether we need to implement BYTE_ARRAY-backed Decimal reads
> 
>
> Key: ARROW-1814
> URL: https://issues.apache.org/jira/browse/ARROW-1814
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Affects Versions: 0.7.1
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>
> These are valid in the parquet spec, but it seems like no system in use today 
> implements a writer for this type.
> We should determine whether this is YAGNI, or if it's actually in use 
> anywhere.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1814) [C++] Determine whether we need to implement BYTE_ARRAY-backed Decimal reads

2017-11-15 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16253752#comment-16253752
 ] 

Phillip Cloud commented on ARROW-1814:
--

Whoops, meant this for parquet-cpp

> [C++] Determine whether we need to implement BYTE_ARRAY-backed Decimal reads
> 
>
> Key: ARROW-1814
> URL: https://issues.apache.org/jira/browse/ARROW-1814
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Affects Versions: 0.7.1
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>
> These are valid in the parquet spec, but it seems like no system in use today 
> implements a writer for this type.
> We should determine whether this is YAGNI, or if it's actually in use 
> anywhere.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1814) [C++] Determine whether we need to implement BYTE_ARRAY-backed Decimal reads

2017-11-15 Thread Phillip Cloud (JIRA)
Phillip Cloud created ARROW-1814:


 Summary: [C++] Determine whether we need to implement 
BYTE_ARRAY-backed Decimal reads
 Key: ARROW-1814
 URL: https://issues.apache.org/jira/browse/ARROW-1814
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Affects Versions: 0.7.1
Reporter: Phillip Cloud
Assignee: Phillip Cloud


These are valid in the parquet spec, but it seems like no system in use today 
implements a writer for this type.

We should determine whether this is YAGNI, or if it's actually in use anywhere.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1839) [C++/Python] Add Decimal Parquet Read/Write Tests

2017-11-19 Thread Phillip Cloud (JIRA)
Phillip Cloud created ARROW-1839:


 Summary: [C++/Python] Add Decimal Parquet Read/Write Tests
 Key: ARROW-1839
 URL: https://issues.apache.org/jira/browse/ARROW-1839
 Project: Apache Arrow
  Issue Type: Test
  Components: C++, Python
Affects Versions: 0.7.1
Reporter: Phillip Cloud
Assignee: Phillip Cloud
 Fix For: 0.8.0






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1863) Should use PyObject_Str or PyObject_Repr in PyObjectStringify

2017-11-27 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16267238#comment-16267238
 ] 

Phillip Cloud commented on ARROW-1863:
--

[~advancedxy] Thanks for the report!

This definitely shouldn't segfault, but {{PyObjectStringify}} is meant to 
convert a Python {{str}}, {{bytes}}, or {{unicode}} type to {{const char*}}, 
it's not meant to take an arbitrary Python object and convert it to a string.

I think this should raise an error, since you're telling arrow to construct an 
array of type string and you're passing a non-string object to it.

It seems arbitrary to enable this behavior for type {{X}} to {{string}}, but 
not for say, {{string}} to {{int64}}. Why should implicit conversion from type 
{{X}} to {{string}} be special?

For example, should this try to convert the string to an integer?

{code}
data = [1, 2, '3']
pyarrow.array(data, type=pyarrow.int64())
{code}

I don't think so.

Implicit casting from one type to another is a slippery slope and one that 
makes it hard to predict the output of a function, especially in the presence 
of the ability to override the string representation of an object.

> Should use PyObject_Str or PyObject_Repr in PyObjectStringify
> -
>
> Key: ARROW-1863
> URL: https://issues.apache.org/jira/browse/ARROW-1863
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Xianjin YE
> Fix For: 0.8.0
>
>
> PyObjectStringify doesn't handle non-string(bytes or utf-8) type correctly. 
> Should use PyObject_Repr(or PyObject_Str) to get string representation of 
> PyObject.
> {code:java}
> struct ARROW_EXPORT PyObjectStringify {
>   OwnedRef tmp_obj;
>   const char* bytes;
>   Py_ssize_t size;
>   explicit PyObjectStringify(PyObject* obj) {
> PyObject* bytes_obj;
> if (PyUnicode_Check(obj)) {
>   bytes_obj = PyUnicode_AsUTF8String(obj);
>   tmp_obj.reset(bytes_obj);
>   bytes = PyBytes_AsString(bytes_obj);
>   size = PyBytes_GET_SIZE(bytes_obj);
> } else if (PyBytes_Check(obj)) {
>   bytes = PyBytes_AsString(obj);
>   size = PyBytes_GET_SIZE(obj);
> } else {
>   bytes = NULLPTR;
>   size = -1;
> }
>   }
> };
> {code}
> should change to 
> {code:java}
> struct ARROW_EXPORT PyObjectStringify {
>   OwnedRef tmp_obj;
>   const char* bytes;
>   Py_ssize_t size;
>   explicit PyObjectStringify(PyObject* obj) {
> PyObject* bytes_obj;
> if (PyUnicode_Check(obj)) {
>   bytes_obj = PyUnicode_AsUTF8String(obj);
>   tmp_obj.reset(bytes_obj);
>   bytes = PyBytes_AsString(bytes_obj);
>   size = PyBytes_GET_SIZE(bytes_obj);
> } else if (PyBytes_Check(obj)) {
>   bytes = PyBytes_AsString(obj);
>   size = PyBytes_GET_SIZE(obj);
> } else {
>   bytes_obj = PyObject_Repr(obj);
>   tmp_obj.reset(bytes_obj);
>   bytes = PyBytes_AsString(bytes_obj);
>   size = PyBytes_GET_SIZE(bytes_obj);
> }
>   }
> };
> {code}
> How do this infect pyarrow? Minimal reproduction case:
> {code:java}
> import pyarrow
> data = ['-10', '-5', {'a': 1}, '0', '5', '10']
> arr = pyarrow.array(data, type=pyarrow.string())
> [1]64491 segmentation fault  ipython
> {code}
> This case is found by my colleague. I would ask him to send a pr here.  
> cc [~wesmckinn]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ARROW-1863) Should use PyObject_Str or PyObject_Repr in PyObjectStringify

2017-11-27 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud reassigned ARROW-1863:


Assignee: Phillip Cloud

> Should use PyObject_Str or PyObject_Repr in PyObjectStringify
> -
>
> Key: ARROW-1863
> URL: https://issues.apache.org/jira/browse/ARROW-1863
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Xianjin YE
>Assignee: Phillip Cloud
> Fix For: 0.8.0
>
>
> PyObjectStringify doesn't handle non-string(bytes or utf-8) type correctly. 
> Should use PyObject_Repr(or PyObject_Str) to get string representation of 
> PyObject.
> {code:java}
> struct ARROW_EXPORT PyObjectStringify {
>   OwnedRef tmp_obj;
>   const char* bytes;
>   Py_ssize_t size;
>   explicit PyObjectStringify(PyObject* obj) {
> PyObject* bytes_obj;
> if (PyUnicode_Check(obj)) {
>   bytes_obj = PyUnicode_AsUTF8String(obj);
>   tmp_obj.reset(bytes_obj);
>   bytes = PyBytes_AsString(bytes_obj);
>   size = PyBytes_GET_SIZE(bytes_obj);
> } else if (PyBytes_Check(obj)) {
>   bytes = PyBytes_AsString(obj);
>   size = PyBytes_GET_SIZE(obj);
> } else {
>   bytes = NULLPTR;
>   size = -1;
> }
>   }
> };
> {code}
> should change to 
> {code:java}
> struct ARROW_EXPORT PyObjectStringify {
>   OwnedRef tmp_obj;
>   const char* bytes;
>   Py_ssize_t size;
>   explicit PyObjectStringify(PyObject* obj) {
> PyObject* bytes_obj;
> if (PyUnicode_Check(obj)) {
>   bytes_obj = PyUnicode_AsUTF8String(obj);
>   tmp_obj.reset(bytes_obj);
>   bytes = PyBytes_AsString(bytes_obj);
>   size = PyBytes_GET_SIZE(bytes_obj);
> } else if (PyBytes_Check(obj)) {
>   bytes = PyBytes_AsString(obj);
>   size = PyBytes_GET_SIZE(obj);
> } else {
>   bytes_obj = PyObject_Repr(obj);
>   tmp_obj.reset(bytes_obj);
>   bytes = PyBytes_AsString(bytes_obj);
>   size = PyBytes_GET_SIZE(bytes_obj);
> }
>   }
> };
> {code}
> How do this infect pyarrow? Minimal reproduction case:
> {code:java}
> import pyarrow
> data = ['-10', '-5', {'a': 1}, '0', '5', '10']
> arr = pyarrow.array(data, type=pyarrow.string())
> [1]64491 segmentation fault  ipython
> {code}
> This case is found by my colleague. I would ask him to send a pr here.  
> cc [~wesmckinn]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1863) [Python] PyObjectStringify could render bytes-like output for more types of objects

2017-11-28 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16269463#comment-16269463
 ] 

Phillip Cloud commented on ARROW-1863:
--

[~advancedxy] That sounds good to me. If we're just using it for error messages 
that's great. Do you (or your colleague) want to put up a PR to fix this?

> [Python] PyObjectStringify could render bytes-like output for more types of 
> objects
> ---
>
> Key: ARROW-1863
> URL: https://issues.apache.org/jira/browse/ARROW-1863
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Xianjin YE
>Assignee: Phillip Cloud
> Fix For: 0.8.0
>
>
> PyObjectStringify doesn't handle non-string(bytes or utf-8) type correctly. 
> Should use PyObject_Repr(or PyObject_Str) to get string representation of 
> PyObject.
> {code:java}
> struct ARROW_EXPORT PyObjectStringify {
>   OwnedRef tmp_obj;
>   const char* bytes;
>   Py_ssize_t size;
>   explicit PyObjectStringify(PyObject* obj) {
> PyObject* bytes_obj;
> if (PyUnicode_Check(obj)) {
>   bytes_obj = PyUnicode_AsUTF8String(obj);
>   tmp_obj.reset(bytes_obj);
>   bytes = PyBytes_AsString(bytes_obj);
>   size = PyBytes_GET_SIZE(bytes_obj);
> } else if (PyBytes_Check(obj)) {
>   bytes = PyBytes_AsString(obj);
>   size = PyBytes_GET_SIZE(obj);
> } else {
>   bytes = NULLPTR;
>   size = -1;
> }
>   }
> };
> {code}
> should change to 
> {code:java}
> struct ARROW_EXPORT PyObjectStringify {
>   OwnedRef tmp_obj;
>   const char* bytes;
>   Py_ssize_t size;
>   explicit PyObjectStringify(PyObject* obj) {
> PyObject* bytes_obj;
> if (PyUnicode_Check(obj)) {
>   bytes_obj = PyUnicode_AsUTF8String(obj);
>   tmp_obj.reset(bytes_obj);
>   bytes = PyBytes_AsString(bytes_obj);
>   size = PyBytes_GET_SIZE(bytes_obj);
> } else if (PyBytes_Check(obj)) {
>   bytes = PyBytes_AsString(obj);
>   size = PyBytes_GET_SIZE(obj);
> } else {
>   bytes_obj = PyObject_Repr(obj);
>   tmp_obj.reset(bytes_obj);
>   bytes = PyBytes_AsString(bytes_obj);
>   size = PyBytes_GET_SIZE(bytes_obj);
> }
>   }
> };
> {code}
> How do this infect pyarrow? Minimal reproduction case:
> {code:java}
> import pyarrow
> data = ['-10', '-5', {'a': 1}, '0', '5', '10']
> arr = pyarrow.array(data, type=pyarrow.string())
> [1]64491 segmentation fault  ipython
> {code}
> This case is found by my colleague. I would ask him to send a pr here.  
> cc [~wesmckinn]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1871) [Python/C++] Appending Python Decimals with different scales requires rescaling

2017-11-29 Thread Phillip Cloud (JIRA)
Phillip Cloud created ARROW-1871:


 Summary: [Python/C++] Appending Python Decimals with different 
scales requires rescaling
 Key: ARROW-1871
 URL: https://issues.apache.org/jira/browse/ARROW-1871
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Affects Versions: 0.7.1
Reporter: Phillip Cloud
Assignee: Phillip Cloud
 Fix For: 0.8.0






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1879) [Python] Dask integration tests are not skipped if dask is not installed

2017-12-03 Thread Phillip Cloud (JIRA)
Phillip Cloud created ARROW-1879:


 Summary: [Python] Dask integration tests are not skipped if dask 
is not installed
 Key: ARROW-1879
 URL: https://issues.apache.org/jira/browse/ARROW-1879
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.7.1
Reporter: Phillip Cloud
Assignee: Phillip Cloud
 Fix For: 0.8.0






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1863) [Python] PyObjectStringify could render bytes-like output for more types of objects

2017-12-03 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16276037#comment-16276037
 ] 

Phillip Cloud commented on ARROW-1863:
--

Looks like this is already fixed in master. I'll add this example to our test 
suite, to prevent regressions.

> [Python] PyObjectStringify could render bytes-like output for more types of 
> objects
> ---
>
> Key: ARROW-1863
> URL: https://issues.apache.org/jira/browse/ARROW-1863
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Xianjin YE
>Assignee: Phillip Cloud
> Fix For: 0.8.0
>
>
> PyObjectStringify doesn't handle non-string(bytes or utf-8) type correctly. 
> Should use PyObject_Repr(or PyObject_Str) to get string representation of 
> PyObject.
> {code:java}
> struct ARROW_EXPORT PyObjectStringify {
>   OwnedRef tmp_obj;
>   const char* bytes;
>   Py_ssize_t size;
>   explicit PyObjectStringify(PyObject* obj) {
> PyObject* bytes_obj;
> if (PyUnicode_Check(obj)) {
>   bytes_obj = PyUnicode_AsUTF8String(obj);
>   tmp_obj.reset(bytes_obj);
>   bytes = PyBytes_AsString(bytes_obj);
>   size = PyBytes_GET_SIZE(bytes_obj);
> } else if (PyBytes_Check(obj)) {
>   bytes = PyBytes_AsString(obj);
>   size = PyBytes_GET_SIZE(obj);
> } else {
>   bytes = NULLPTR;
>   size = -1;
> }
>   }
> };
> {code}
> should change to 
> {code:java}
> struct ARROW_EXPORT PyObjectStringify {
>   OwnedRef tmp_obj;
>   const char* bytes;
>   Py_ssize_t size;
>   explicit PyObjectStringify(PyObject* obj) {
> PyObject* bytes_obj;
> if (PyUnicode_Check(obj)) {
>   bytes_obj = PyUnicode_AsUTF8String(obj);
>   tmp_obj.reset(bytes_obj);
>   bytes = PyBytes_AsString(bytes_obj);
>   size = PyBytes_GET_SIZE(bytes_obj);
> } else if (PyBytes_Check(obj)) {
>   bytes = PyBytes_AsString(obj);
>   size = PyBytes_GET_SIZE(obj);
> } else {
>   bytes_obj = PyObject_Repr(obj);
>   tmp_obj.reset(bytes_obj);
>   bytes = PyBytes_AsString(bytes_obj);
>   size = PyBytes_GET_SIZE(bytes_obj);
> }
>   }
> };
> {code}
> How do this infect pyarrow? Minimal reproduction case:
> {code:java}
> import pyarrow
> data = ['-10', '-5', {'a': 1}, '0', '5', '10']
> arr = pyarrow.array(data, type=pyarrow.string())
> [1]64491 segmentation fault  ipython
> {code}
> This case is found by my colleague. I would ask him to send a pr here.  
> cc [~wesmckinn]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1863) [Python] PyObjectStringify could render bytes-like output for more types of objects

2017-12-03 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16276065#comment-16276065
 ] 

Phillip Cloud commented on ARROW-1863:
--

Interesting, this is only showing up on OS X.

> [Python] PyObjectStringify could render bytes-like output for more types of 
> objects
> ---
>
> Key: ARROW-1863
> URL: https://issues.apache.org/jira/browse/ARROW-1863
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Xianjin YE
>Assignee: Phillip Cloud
> Fix For: 0.8.0
>
>
> PyObjectStringify doesn't handle non-string(bytes or utf-8) type correctly. 
> Should use PyObject_Repr(or PyObject_Str) to get string representation of 
> PyObject.
> {code:java}
> struct ARROW_EXPORT PyObjectStringify {
>   OwnedRef tmp_obj;
>   const char* bytes;
>   Py_ssize_t size;
>   explicit PyObjectStringify(PyObject* obj) {
> PyObject* bytes_obj;
> if (PyUnicode_Check(obj)) {
>   bytes_obj = PyUnicode_AsUTF8String(obj);
>   tmp_obj.reset(bytes_obj);
>   bytes = PyBytes_AsString(bytes_obj);
>   size = PyBytes_GET_SIZE(bytes_obj);
> } else if (PyBytes_Check(obj)) {
>   bytes = PyBytes_AsString(obj);
>   size = PyBytes_GET_SIZE(obj);
> } else {
>   bytes = NULLPTR;
>   size = -1;
> }
>   }
> };
> {code}
> should change to 
> {code:java}
> struct ARROW_EXPORT PyObjectStringify {
>   OwnedRef tmp_obj;
>   const char* bytes;
>   Py_ssize_t size;
>   explicit PyObjectStringify(PyObject* obj) {
> PyObject* bytes_obj;
> if (PyUnicode_Check(obj)) {
>   bytes_obj = PyUnicode_AsUTF8String(obj);
>   tmp_obj.reset(bytes_obj);
>   bytes = PyBytes_AsString(bytes_obj);
>   size = PyBytes_GET_SIZE(bytes_obj);
> } else if (PyBytes_Check(obj)) {
>   bytes = PyBytes_AsString(obj);
>   size = PyBytes_GET_SIZE(obj);
> } else {
>   bytes_obj = PyObject_Repr(obj);
>   tmp_obj.reset(bytes_obj);
>   bytes = PyBytes_AsString(bytes_obj);
>   size = PyBytes_GET_SIZE(bytes_obj);
> }
>   }
> };
> {code}
> How do this infect pyarrow? Minimal reproduction case:
> {code:java}
> import pyarrow
> data = ['-10', '-5', {'a': 1}, '0', '5', '10']
> arr = pyarrow.array(data, type=pyarrow.string())
> [1]64491 segmentation fault  ipython
> {code}
> This case is found by my colleague. I would ask him to send a pr here.  
> cc [~wesmckinn]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ARROW-1883) [Python] BUG: Table.to_pandas metadata checking fails if columns are not present

2017-12-04 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud reassigned ARROW-1883:


Assignee: Phillip Cloud

> [Python] BUG: Table.to_pandas metadata checking fails if columns are not 
> present
> 
>
> Key: ARROW-1883
> URL: https://issues.apache.org/jira/browse/ARROW-1883
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Joris Van den Bossche
>Assignee: Phillip Cloud
>  Labels: pull-request-available
>
> Found this bug in the example in the pandas documentation 
> (http://pandas-docs.github.io/pandas-docs-travis/io.html#parquet), which does:
> {code}
> df = pd.DataFrame({'a': list('abc'),
>'b': list(range(1, 4)),
>'c': np.arange(3, 6).astype('u1'),
>'d': np.arange(4.0, 7.0, dtype='float64'),
>'e': [True, False, True],
>'f': pd.date_range('20130101', periods=3),
>'g': pd.date_range('20130101', periods=3, 
> tz='US/Eastern')})
> df.to_parquet('example_pa.parquet', engine='pyarrow')
> pd.read_parquet('example_pa.parquet', engine='pyarrow', columns=['a', 'b'])
> {code}
> and this raises in the last line reading a subset of columns:
> {code}
> ...
> /home/joris/miniconda3/envs/dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in _add_any_metadata(table, pandas_metadata)
> 357 for i, col_meta in enumerate(pandas_metadata['columns']):
> 358 if col_meta['pandas_type'] == 'datetimetz':
> --> 359 col = table[i]
> 360 converted = col.to_pandas()
> 361 tz = col_meta['metadata']['timezone']
> table.pxi in pyarrow.lib.Table.__getitem__()
> table.pxi in pyarrow.lib.Table.column()
> IndexError: Table column index 6 is out of range
> {code}
> This is due to checking the `pandas_metadata` for all columns (and in this 
> case trying to deal with a datetime tz column), while in practice not all 
> columns are present in this case ('mismatch' between pandas metadata and 
> actual schema). 
> A smaller example without parquet:
> {code}
> In [38]: df = pd.DataFrame({'a': [1, 2, 3], 'b': pd.date_range("2017-01-01", 
> periods=3, tz='Europe/Brussels')})
> In [39]: table = pyarrow.Table.from_pandas(df)
> In [40]: table
> Out[40]: 
> pyarrow.Table
> a: int64
> b: timestamp[ns, tz=Europe/Brussels]
> __index_level_0__: int64
> metadata
> 
> {b'pandas': b'{"columns": [{"pandas_type": "int64", "metadata": null, 
> "numpy_t'
> b'ype": "int64", "name": "a"}, {"pandas_type": "datetimetz", 
> "meta'
> b'data": {"timezone": "Europe/Brussels"}, "numpy_type": 
> "datetime6'
> b'4[ns, Europe/Brussels]", "name": "b"}, {"pandas_type": "int64", 
> '
> b'"metadata": null, "numpy_type": "int64", "name": 
> "__index_level_'
> b'0__"}], "index_columns": ["__index_level_0__"], 
> "pandas_version"'
> b': "0.22.0.dev0+277.gd61f411"}'}
> In [41]: table.to_pandas()
> Out[41]: 
>a b
> 0  1 2017-01-01 00:00:00+01:00
> 1  2 2017-01-02 00:00:00+01:00
> 2  3 2017-01-03 00:00:00+01:00
> In [44]: table_without_tz = table.remove_column(1)
> In [45]: table_without_tz
> Out[45]: 
> pyarrow.Table
> a: int64
> __index_level_0__: int64
> metadata
> 
> {b'pandas': b'{"columns": [{"pandas_type": "int64", "metadata": null, 
> "numpy_t'
> b'ype": "int64", "name": "a"}, {"pandas_type": "datetimetz", 
> "meta'
> b'data": {"timezone": "Europe/Brussels"}, "numpy_type": 
> "datetime6'
> b'4[ns, Europe/Brussels]", "name": "b"}, {"pandas_type": "int64", 
> '
> b'"metadata": null, "numpy_type": "int64", "name": 
> "__index_level_'
> b'0__"}], "index_columns": ["__index_level_0__"], 
> "pandas_version"'
> b': "0.22.0.dev0+277.gd61f411"}'}
> In [46]: table_without_tz.to_pandas()  # <-- wrong output !
> Out[46]: 
>  a
> 1970-01-01 01:00:00+01:001
> 1970-01-01 01:00:00.1+01:00  2
> 1970-01-01 01:00:00.2+01:00  3
> In [47]: table_without_tz2 = table_without_tz.remove_column(1)
> In [48]: table_without_tz2
> Out[48]: 
> pyarrow.Table
> a: int64
> metadata
> 
> {b'pandas': b'{"columns": [{"pandas_type": "int64", "metadata": null, 
> "numpy_t'
> b'ype": "int64", "name": "a"}, {"pandas_type": "datetimetz", 
> "meta'
> b'data": {"timezone": "Europe/Brussels"}, "numpy_type": 
> "datetime6'
> b'4[ns, Europe/Brussels]", "name": "b"}, {"pandas_type": "int64", 
> '
> b'"metadata": null, "numpy_type": "int64", "name": 
> "__index_level_'
> 

[jira] [Created] (ARROW-1895) Add field_name to pandas index metadata

2017-12-06 Thread Phillip Cloud (JIRA)
Phillip Cloud created ARROW-1895:


 Summary: Add field_name to pandas index metadata
 Key: ARROW-1895
 URL: https://issues.apache.org/jira/browse/ARROW-1895
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 0.7.1
Reporter: Phillip Cloud
Assignee: Phillip Cloud
 Fix For: 0.8.0


See the discussion here for details:

https://github.com/pandas-dev/pandas/pull/18201

In short we need a way to map index column names to field names in an arrow 
Table.

Additionally, we're depending on the index columns being written at the end of 
the table and fixing this would allow us to read metadata written by other 
systems (e.g., fastparquet) that don't make this assumption.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1895) Add field_name to pandas index metadata

2017-12-06 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud updated ARROW-1895:
-
Component/s: Python

> Add field_name to pandas index metadata
> ---
>
> Key: ARROW-1895
> URL: https://issues.apache.org/jira/browse/ARROW-1895
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
> Fix For: 0.8.0
>
>
> See the discussion here for details:
> https://github.com/pandas-dev/pandas/pull/18201
> In short we need a way to map index column names to field names in an arrow 
> Table.
> Additionally, we're depending on the index columns being written at the end 
> of the table and fixing this would allow us to read metadata written by other 
> systems (e.g., fastparquet) that don't make this assumption.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1895) [Python] Add field_name to pandas index metadata

2017-12-06 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud updated ARROW-1895:
-
Summary: [Python] Add field_name to pandas index metadata  (was: Add 
field_name to pandas index metadata)

> [Python] Add field_name to pandas index metadata
> 
>
> Key: ARROW-1895
> URL: https://issues.apache.org/jira/browse/ARROW-1895
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
> Fix For: 0.8.0
>
>
> See the discussion here for details:
> https://github.com/pandas-dev/pandas/pull/18201
> In short we need a way to map index column names to field names in an arrow 
> Table.
> Additionally, we're depending on the index columns being written at the end 
> of the table and fixing this would allow us to read metadata written by other 
> systems (e.g., fastparquet) that don't make this assumption.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ARROW-1897) Incorrect numpy_type for pandas metadata of Categoricals

2017-12-07 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud reassigned ARROW-1897:


Assignee: Phillip Cloud

> Incorrect numpy_type for pandas metadata of Categoricals
> 
>
> Key: ARROW-1897
> URL: https://issues.apache.org/jira/browse/ARROW-1897
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Tom Augspurger
>Assignee: Phillip Cloud
>  Labels: categorical, metadata, pandas, parquet, pyarrow
> Fix For: 0.9.0
>
>
> If I'm reading 
> http://pandas-docs.github.io/pandas-docs-travis/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format
>  correctly, the "numpy_type" field of a `Categorical` should be the storage 
> type used for the *codes*. It looks like pyarrow is just using 'object' 
> always.
> {code}
> In [1]: import pandas as pd
> In [2]: import pyarrow as pa
> In [3]: import pyarrow.parquet as pq
> In [4]: import io
> In [5]: import json
> In [6]: df = pd.DataFrame({"A": [1, 2]},
>...:   index=pd.CategoricalIndex(['one', 'two'], 
> name='idx'))
>...:
> In [8]: sink = io.BytesIO()
>...: pq.write_metadata(pa.Table.from_pandas(df).schema, sink)
>...: 
> json.loads(pq.read_metadata(sink).metadata[b'pandas'].decode('utf-8'))['columns'][-1]
>...:
> Out[8]:
> {'field_name': '__index_level_0__',
>  'metadata': {'num_categories': 2, 'ordered': False},
>  'name': 'idx',
>  'numpy_type': 'object',
>  'pandas_type': 'categorical'}
> {code}
> From the spec:
> bq. The numpy_type is the physical storage type of the column, which is the 
> result of str(dtype) for the underlying NumPy array that holds the data. So 
> for datetimetz this is datetime64[ns] and for categorical, it may be any of 
> the supported integer categorical types.
> So 'numpy_type' field should be something like `'int8'` instead of `'object'`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1897) Incorrect numpy_type for pandas metadata of Categoricals

2017-12-07 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16281861#comment-16281861
 ] 

Phillip Cloud commented on ARROW-1897:
--

I think we can get this in for 0.8.0. I want to avoid another backwards compat 
issue so best to take care of as many of these as we can. 

> Incorrect numpy_type for pandas metadata of Categoricals
> 
>
> Key: ARROW-1897
> URL: https://issues.apache.org/jira/browse/ARROW-1897
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Tom Augspurger
>Assignee: Phillip Cloud
>  Labels: categorical, metadata, pandas, parquet, pyarrow
> Fix For: 0.9.0
>
>
> If I'm reading 
> http://pandas-docs.github.io/pandas-docs-travis/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format
>  correctly, the "numpy_type" field of a `Categorical` should be the storage 
> type used for the *codes*. It looks like pyarrow is just using 'object' 
> always.
> {code}
> In [1]: import pandas as pd
> In [2]: import pyarrow as pa
> In [3]: import pyarrow.parquet as pq
> In [4]: import io
> In [5]: import json
> In [6]: df = pd.DataFrame({"A": [1, 2]},
>...:   index=pd.CategoricalIndex(['one', 'two'], 
> name='idx'))
>...:
> In [8]: sink = io.BytesIO()
>...: pq.write_metadata(pa.Table.from_pandas(df).schema, sink)
>...: 
> json.loads(pq.read_metadata(sink).metadata[b'pandas'].decode('utf-8'))['columns'][-1]
>...:
> Out[8]:
> {'field_name': '__index_level_0__',
>  'metadata': {'num_categories': 2, 'ordered': False},
>  'name': 'idx',
>  'numpy_type': 'object',
>  'pandas_type': 'categorical'}
> {code}
> From the spec:
> bq. The numpy_type is the physical storage type of the column, which is the 
> result of str(dtype) for the underlying NumPy array that holds the data. So 
> for datetimetz this is datetime64[ns] and for categorical, it may be any of 
> the supported integer categorical types.
> So 'numpy_type' field should be something like `'int8'` instead of `'object'`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1908) [Python] Construction of arrow table from pandas DataFrame allows duplicate column names

2017-12-09 Thread Phillip Cloud (JIRA)
Phillip Cloud created ARROW-1908:


 Summary: [Python] Construction of arrow table from pandas 
DataFrame allows duplicate column names
 Key: ARROW-1908
 URL: https://issues.apache.org/jira/browse/ARROW-1908
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.7.1
Reporter: Phillip Cloud
Assignee: Phillip Cloud
 Fix For: 0.8.0


[~jorisvandenbossche]'s example here: 
https://github.com/pandas-dev/pandas/pull/18201#issuecomment-350259248 shows 
that a {{pyarrow.Table}} with duplicate column names can be constructed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1908) [Python] Construction of arrow table from pandas DataFrame allows duplicate column names

2017-12-09 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16284810#comment-16284810
 ] 

Phillip Cloud commented on ARROW-1908:
--

In [~jorisvandenbossche]'s example the construction succeeds, but in the 
following example the interpreter crashes:

{code}
In [1]: df = pd.DataFrame([(1, 'a'), (2, 'b')], columns=list('aa'))

In [2]: t = pa.Table.from_pandas(df)
{code}

> [Python] Construction of arrow table from pandas DataFrame allows duplicate 
> column names
> 
>
> Key: ARROW-1908
> URL: https://issues.apache.org/jira/browse/ARROW-1908
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
> Fix For: 0.8.0
>
>
> [~jorisvandenbossche]'s example here: 
> https://github.com/pandas-dev/pandas/pull/18201#issuecomment-350259248 shows 
> that a {{pyarrow.Table}} with duplicate column names can be constructed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1908) [Python] Construction of arrow table from pandas DataFrame allows duplicate column names

2017-12-09 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16284962#comment-16284962
 ] 

Phillip Cloud commented on ARROW-1908:
--

I think we shouldn't allow duplicate field names. There are so many problems 
that arise when columns can have the same name. One that shows up in the 
arrow-cpp implementation is that calls to {{GetFieldByName}} become 
unpredictable.

> [Python] Construction of arrow table from pandas DataFrame allows duplicate 
> column names
> 
>
> Key: ARROW-1908
> URL: https://issues.apache.org/jira/browse/ARROW-1908
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
> Fix For: 0.8.0
>
>
> [~jorisvandenbossche]'s example here: 
> https://github.com/pandas-dev/pandas/pull/18201#issuecomment-350259248 shows 
> that a {{pyarrow.Table}} with duplicate column names can be constructed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1908) [Python] Construction of arrow table from pandas DataFrame allows duplicate column names

2017-12-09 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16284965#comment-16284965
 ] 

Phillip Cloud commented on ARROW-1908:
--

I can bring this up on the mailing list and we can address it in 0.9.0

> [Python] Construction of arrow table from pandas DataFrame allows duplicate 
> column names
> 
>
> Key: ARROW-1908
> URL: https://issues.apache.org/jira/browse/ARROW-1908
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
> Fix For: 0.8.0
>
>
> [~jorisvandenbossche]'s example here: 
> https://github.com/pandas-dev/pandas/pull/18201#issuecomment-350259248 shows 
> that a {{pyarrow.Table}} with duplicate column names can be constructed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1908) [Python] Construction of arrow table from pandas DataFrame allows duplicate column names

2017-12-09 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16284971#comment-16284971
 ] 

Phillip Cloud commented on ARROW-1908:
--

Yep, I have a fix coming for that.

> [Python] Construction of arrow table from pandas DataFrame allows duplicate 
> column names
> 
>
> Key: ARROW-1908
> URL: https://issues.apache.org/jira/browse/ARROW-1908
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
> Fix For: 0.8.0
>
>
> [~jorisvandenbossche]'s example here: 
> https://github.com/pandas-dev/pandas/pull/18201#issuecomment-350259248 shows 
> that a {{pyarrow.Table}} with duplicate column names can be constructed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1908) [Python] Construction of arrow table from pandas DataFrame with duplicate column names crashes

2017-12-09 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud updated ARROW-1908:
-
Summary: [Python] Construction of arrow table from pandas DataFrame with 
duplicate column names crashes  (was: [Python] Construction of arrow table from 
pandas DataFrame allows duplicate column names)

> [Python] Construction of arrow table from pandas DataFrame with duplicate 
> column names crashes
> --
>
> Key: ARROW-1908
> URL: https://issues.apache.org/jira/browse/ARROW-1908
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
> Fix For: 0.8.0
>
>
> [~jorisvandenbossche]'s example here: 
> https://github.com/pandas-dev/pandas/pull/18201#issuecomment-350259248 shows 
> that a {{pyarrow.Table}} with duplicate column names can be constructed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1908) [Python] Construction of arrow table from pandas DataFrame with duplicate column names crashes

2017-12-09 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud updated ARROW-1908:
-
Labels: pandas python  (was: )

> [Python] Construction of arrow table from pandas DataFrame with duplicate 
> column names crashes
> --
>
> Key: ARROW-1908
> URL: https://issues.apache.org/jira/browse/ARROW-1908
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>  Labels: pandas, python
> Fix For: 0.8.0
>
>
> [~jorisvandenbossche]'s example here: 
> https://github.com/pandas-dev/pandas/pull/18201#issuecomment-350259248 shows 
> that a {{pyarrow.Table}} with duplicate column names can be constructed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1950) [Python] pandas_type in pandas metadata incorrect for List types

2017-12-26 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16304052#comment-16304052
 ] 

Phillip Cloud commented on ARROW-1950:
--

There are a couple issues here:

1. We're mapping arrow null types to float64 because of nans, but I don't think 
that's correct since null is both tye type of typeless missing values _and_ the 
type of empty things, including the value type for empty lists and maps. Really 
the {{pandas_type}} should be {{'empty'}}.
2. The C++ table_to_blocks function doesn't actually support null types

> [Python] pandas_type in pandas metadata incorrect for List types
> 
>
> Key: ARROW-1950
> URL: https://issues.apache.org/jira/browse/ARROW-1950
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
> Fix For: 0.9.0
>
>
> see https://github.com/pandas-dev/pandas/pull/18201#issuecomment-353042438



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (ARROW-1950) [Python] pandas_type in pandas metadata incorrect for List types

2017-12-26 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16304052#comment-16304052
 ] 

Phillip Cloud edited comment on ARROW-1950 at 12/26/17 10:59 PM:
-

There are a couple issues here:

1. We're mapping arrow null types to float64 because of nans, but I don't think 
that's correct since null is both tye type of typeless missing values _and_ the 
type of empty things, including the value type for empty lists and maps. Really 
the {{pandas_type}} should be {{'empty'}}.
2. The C++ {{table_to_blocks}} function doesn't actually support null types


was (Author: cpcloud):
There are a couple issues here:

1. We're mapping arrow null types to float64 because of nans, but I don't think 
that's correct since null is both tye type of typeless missing values _and_ the 
type of empty things, including the value type for empty lists and maps. Really 
the {{pandas_type}} should be {{'empty'}}.
2. The C++ table_to_blocks function doesn't actually support null types

> [Python] pandas_type in pandas metadata incorrect for List types
> 
>
> Key: ARROW-1950
> URL: https://issues.apache.org/jira/browse/ARROW-1950
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
> Fix For: 0.9.0
>
>
> see https://github.com/pandas-dev/pandas/pull/18201#issuecomment-353042438



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ARROW-1950) [Python] pandas_type in pandas metadata incorrect for List types

2017-12-26 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud reassigned ARROW-1950:


Assignee: Phillip Cloud

> [Python] pandas_type in pandas metadata incorrect for List types
> 
>
> Key: ARROW-1950
> URL: https://issues.apache.org/jira/browse/ARROW-1950
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Phillip Cloud
> Fix For: 0.9.0
>
>
> see https://github.com/pandas-dev/pandas/pull/18201#issuecomment-353042438



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ARROW-1941) Table <–> DataFrame roundtrip failing

2017-12-26 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud reassigned ARROW-1941:


Assignee: Phillip Cloud  (was: Licht Takeuchi)

> Table <–> DataFrame roundtrip failing
> -
>
> Key: ARROW-1941
> URL: https://issues.apache.org/jira/browse/ARROW-1941
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Thomas Buhrmann
>Assignee: Phillip Cloud
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Although it is possible to create an Arrow table with a column containing 
> only empty lists (cast to a particular type, e.g. string), in a roundtrip 
> through pandas the original type is lost, it seems, and subsequently attempts 
> to convert to pandas then fail.
> To reproduce in PyArrow 0.8.0:
> {code}
> import pyarrow as pa
> # Create table with array of empty lists, forced to have type list(string)
> arrays = {
> 'c1': pa.array([["test"], ["a", "b"], None], type=pa.list_(pa.string())),
> 'c2': pa.array([[], [], []], type=pa.list_(pa.string())),
> }
> rb = pa.RecordBatch.from_arrays(list(arrays.values()), list(arrays.keys()))
> tbl = pa.Table.from_batches([rb])
> print("Schema 1 (correct):\n{}".format(tbl.schema))
> # First roundtrip changes schema
> df = tbl.to_pandas()
> tbl2 = pa.Table.from_pandas(df)
> print("\nSchema 2 (wrong):\n{}".format(tbl2.schema))
> # Second roundtrip explodes
> df2 = tbl2.to_pandas()
> {code}
> This results in the following output:
> {code}
> Schema 1 (correct):
> c1: list
>   child 0, item: string
> c2: list
>   child 0, item: string
> Schema 2 (wrong):
> c1: list
>   child 0, item: string
> c2: list
>   child 0, item: null
> __index_level_0__: int64
> metadata
> 
> {b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": 
> [{"na'
> b'me": null, "field_name": null, "pandas_type": "unicode", 
> "numpy_'
> b'type": "object", "metadata": {"encoding": "UTF-8"}}], 
> "columns":'
> b' [{"name": "c1", "field_name": "c1", "pandas_type": 
> "list[unicod'
> b'e]", "numpy_type": "object", "metadata": null}, {"name": "c2", 
> "'
> b'field_name": "c2", "pandas_type": "list[float64]", 
> "numpy_type":'
> b' "object", "metadata": null}, {"name": null, "field_name": 
> "__in'
> b'dex_level_0__", "pandas_type": "int64", "numpy_type": "int64", 
> "'
> b'metadata": null}], "pandas_version": "0.21.1"}'}
> ...
> > ArrowNotImplementedError: Not implemented type for list in DataFrameBlock: 
> > null
> {code}
> I.e., the array of empty lists of strings gets converted into an array of 
> lists of type null, and in the pandas schema to lists of type float64.
> If one changes the empty lists to values of None in the creation of the 
> record batches, the roundtrip doesn't explode, but it will silently convert 
> the column to a simple column of type float (i.e. I lose the list type) in 
> pandas. This doesn't help, since other batches from the same source might 
> have non-empty lists and would end up with a different inferred schema, and 
> so can't be concatenated into a single table.
> (If this attempt at a double roundtrip seems weird, in my use case I receive 
> data from a server in RecordBatches, which I convert to pandas for 
> manipulation. I then serialize this data to disk using Arrow, and later need 
> to read it back into pandas again for further manipulation. So I need to be 
> able to go through various rounds of table->df->table->df->table etc., where 
> at any time a record batch may have columns that contain only empty lists).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1976) Handling unicode pandas columns on pq.read_table

2018-01-09 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16318720#comment-16318720
 ] 

Phillip Cloud commented on ARROW-1976:
--

Note this is python 2 specific. You won't run into issues like this if you 
don't use python 2.

If there's no restriction on the version of Python you need to use please use 
python 3.

That said, since we have to support python 2, this is a bug.

How is it possible to read in Unicode from a CSV file without specifying an 
encoding to {{read_csv}}? Pandas must make an assumption about the encoding or 
choose a default.

I've also submitted a data issue to data.gov to request that they include the 
encoding in the metadata.

https://www.data.gov/issue/request-id/635154

> Handling unicode pandas columns on pq.read_table
> 
>
> Key: ARROW-1976
> URL: https://issues.apache.org/jira/browse/ARROW-1976
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Simbarashe Nyatsanga
>
> Unicode columns in pandas DataFrames aren't being handled correctly for some 
> datasets when reading a parquet file into a pandas DataFrame, leading to the 
> common Python ASCII encoding error.
>  
> The dataset used to get the error is here: 
> https://catalog.data.gov/dataset/college-scorecard
> {code}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.read_csv('college_data.csv')
> {code}
> For verification, the DataFrame's columns are indeed unicode
> {code}
> df.columns
> > Index([u'UNITID', u'OPEID', u'OPEID6', u'INSTNM', u'CITY', u'STABBR',
>u'INSTURL', u'NPCURL', u'HCM2', u'PREDDEG',
>...
>u'RET_PTL4', u'PCTFLOAN', u'UG25ABV', u'MD_EARN_WNE_P10', u'GT_25K_P6',
>u'GRAD_DEBT_MDN_SUPP', u'GRAD_DEBT_MDN10YR_SUPP', u'RPY_3YR_RT_SUPP',
>u'C150_L4_POOLED_SUPP', u'C150_4_POOLED_SUPP'],
>   dtype='object', length=123)
> {code}
> The DataFrame can be saved into a parquet file
> {code}
> arrow_table = pa.Table.from_pandas(df)
> pq.write_table(arrow_table, 'college_data.parquet')
> {code}
> But trying to read the parquet file immediately afterwards results in the 
> following
> {code}
> df = pq.read_table('college_data.parquet').to_pandas()
> > ---
> UnicodeEncodeErrorTraceback (most recent call last)
>  in ()
> > 2 df = pq.read_table('college_data.parquet').to_pandas()
> /Users/anaconda/envs/env/lib/python2.7/site-packages/pyarrow/table.pxi in 
> pyarrow.lib.Table.to_pandas 
> (/Users/travis/build/BryanCutler/arrow-dist/arrow/python/build/temp.macosx-10.6-intel-2.7/lib.cxx:46331)()
>1041 if nthreads is None:
>1042 nthreads = cpu_count()
> -> 1043 mgr = pdcompat.table_to_blockmanager(options, self, 
> memory_pool,
>1044  nthreads)
>1045 return pd.DataFrame(mgr)
> /Users/anaconda/envs/env/lib/python2.7/site-packages/pyarrow/pandas_compat.pyc
>  in table_to_blockmanager(options, table, memory_pool, nthreads, categoricals)
> 539 if columns:
> 540 columns_name_dict = {
> --> 541 c.get('field_name', str(c['name'])): c['name'] for c in 
> columns
> 542 }
> 543 columns_values = [
> /Users/anaconda/envs/env/lib/python2.7/site-packages/pyarrow/pandas_compat.pyc
>  in ((c,))
> 539 if columns:
> 540 columns_name_dict = {
> --> 541 c.get('field_name', str(c['name'])): c['name'] for c in 
> columns
> 542 }
> 543 columns_values = [
> UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in 
> position 0: ordinal not in range(128)
> {code}
> Looking at the stacktrace , it looks like this line, which is using str which 
> by default will try to do ascii encoding: 
> https://github.com/apache/arrow/blob/master/python/pyarrow/pandas_compat.py#L541



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ARROW-1973) [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes.

2018-01-24 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud reassigned ARROW-1973:


Assignee: Phillip Cloud

> [Python] Memory leak when converting Arrow tables with array columns to 
> Pandas dataframes.
> --
>
> Key: ARROW-1973
> URL: https://issues.apache.org/jira/browse/ARROW-1973
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Phillip Cloud
>Priority: Major
> Fix For: 0.9.0
>
>
> There appears to be a memory leak when using PyArrow to convert tables 
> containing array columns to Pandas DataFrames.
>  See the `test_memory_leak.py` example here: 
> https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-1938) [Python] Error writing to partitioned Parquet dataset

2018-01-24 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud reassigned ARROW-1938:


Assignee: Phillip Cloud

> [Python] Error writing to partitioned Parquet dataset
> -
>
> Key: ARROW-1938
> URL: https://issues.apache.org/jira/browse/ARROW-1938
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Linux (Ubuntu 16.04)
>Reporter: Robert Dailey
>Assignee: Phillip Cloud
>Priority: Major
> Fix For: 0.9.0
>
> Attachments: pyarrow_dataset_error.png
>
>
> I receive the following error after upgrading to pyarrow 0.8.0 when writing 
> to a dataset:
> * ArrowIOError: Column 3 had 187374 while previous column had 1
> The command was:
> write_table_values = {'row_group_size': 1}
> pq.write_to_dataset(pa.Table.from_pandas(df, preserve_index=True), 
> '/logs/parsed/test', partition_cols=['Product', 'year', 'month', 'day', 
> 'hour'], **write_table_values)
> I've also tried write_table_values = {'chunk_size': 1} and received the 
> same error.
> This same command works in version 0.7.1.  I am trying to troubleshoot the 
> problem but wanted to submit a ticket.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2014) [Python] Document read_pandas method in pyarrow.parquet

2018-01-24 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud reassigned ARROW-2014:


Assignee: Phillip Cloud

> [Python] Document read_pandas method in pyarrow.parquet
> ---
>
> Key: ARROW-2014
> URL: https://issues.apache.org/jira/browse/ARROW-2014
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Phillip Cloud
>Priority: Major
> Fix For: 0.9.0
>
>
> see discussion in https://github.com/apache/arrow/issues/1302



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2016) [Python] Fix up ASV benchmarking setup and document procedure for use

2018-01-24 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud reassigned ARROW-2016:


Assignee: Phillip Cloud

> [Python] Fix up ASV benchmarking setup and document procedure for use
> -
>
> Key: ARROW-2016
> URL: https://issues.apache.org/jira/browse/ARROW-2016
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Phillip Cloud
>Priority: Major
> Fix For: 0.9.0
>
>
> We need to start writing more microbenchmarks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-1999) [Python] from_numpy_dtype returns wrong types

2018-01-24 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud reassigned ARROW-1999:


Assignee: Phillip Cloud

> [Python] from_numpy_dtype returns wrong types
> -
>
> Key: ARROW-1999
> URL: https://issues.apache.org/jira/browse/ARROW-1999
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Windows 10 Build 15063.850
> Python: 3.6.3
> Numpy: 1.14.0
>Reporter: Victor Jimenez
>Assignee: Phillip Cloud
>Priority: Major
> Fix For: 0.9.0
>
>
> The following code shows multiple issues when using {{from_numpy_dtype}}:
> {code}
> import numpy as np
> import pyarrow as pa
> pa.from_numpy_dtype(np.unicode) # returns DataType(bool)
> pa.from_numpy_dtype(np.int) # returns DataType(bool)
> pa.from_numpy_dtype(np.int64) # Fails with the following message:
> #
> # ArrowNotImplementedError Traceback (most recent call last)
> #  in ()
> # > 1 pa.from_numpy_dtype(np.int64)
> # 2
> #
> # types.pxi in pyarrow.lib.from_numpy_dtype()
> #
> # error.pxi in pyarrow.lib.check_status()
> #
> # ArrowNotImplementedError: Unsupported numpy type 32760
> {code}
> Additionally, a potentially related issue is also seen when using 
> {{to_pandas_dtype}}:
> {code}
> pa.DataType.to_pandas_dtype(pa.string()) # Returns numpy.object_ 
>  # (shouldn't it be numpy.unicode?)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-1998) [Python] Table.from_pandas crashes when data frame is empty

2018-01-24 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud reassigned ARROW-1998:


Assignee: Phillip Cloud

> [Python] Table.from_pandas crashes when data frame is empty
> ---
>
> Key: ARROW-1998
> URL: https://issues.apache.org/jira/browse/ARROW-1998
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Windows 10 Build 15063.850
> Python: 3.6.3
> Numpy: 1.14.0
> Pandas: 0.22.0
>Reporter: Victor Jimenez
>Assignee: Phillip Cloud
>Priority: Major
> Fix For: 0.9.0
>
>
> Loading an empty CSV file, and then attempting to create a PyArrow Table from 
> it makes the application crash. The following code should be able to 
> reproduce the issue:
> {code}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> FIELDS = ['id', 'name']
> NUMPY_TYPES = {
> 'id': np.int64,
> 'name': np.unicode
> }
> PYARROW_SCHEMA = pa.schema([
> pa.field('id', pa.int64()),
> pa.field('name', pa.string())
> ])
> file = open('input.csv', 'w')
> file.close()
> df = pd.read_csv(
> 'input.csv',
> header=None,
> names=FIELDS,
> dtype=NUMPY_TYPES,
> engine='c',
> )
> pa.Table.from_pandas(df, schema=PYARROW_SCHEMA)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns

2018-01-24 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud reassigned ARROW-1974:


Assignee: Phillip Cloud

> [Python] Segfault when working with Arrow tables with duplicate columns
> ---
>
> Key: ARROW-1974
> URL: https://issues.apache.org/jira/browse/ARROW-1974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Phillip Cloud
>Priority: Minor
> Fix For: 0.9.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-1940) [Python] Extra metadata gets added after multiple conversions between pd.DataFrame and pa.Table

2018-01-24 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud reassigned ARROW-1940:


Assignee: Phillip Cloud

> [Python] Extra metadata gets added after multiple conversions between 
> pd.DataFrame and pa.Table
> ---
>
> Key: ARROW-1940
> URL: https://issues.apache.org/jira/browse/ARROW-1940
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Dima Ryazanov
>Assignee: Phillip Cloud
>Priority: Minor
> Fix For: 0.9.0
>
> Attachments: fail.py
>
>
> We have a unit test that verifies that loading a dataframe from a .parq file 
> and saving it back with no changes produces the same result as the original 
> file. It started failing with pyarrow 0.8.0.
> After digging into it, I discovered that after the first conversion from 
> pd.DataFrame to pa.Table, the table contains the following metadata (among 
> other things):
> {code}
> "column_indexes": [{"metadata": null, "field_name": null, "name": null, 
> "numpy_type": "object", "pandas_type": "bytes"}]
> {code}
> However, after converting it to pd.DataFrame and back into a pa.Table for the 
> second time, the metadata gets an encoding field:
> {code}
> "column_indexes": [{"metadata": {"encoding": "UTF-8"}, "field_name": null, 
> "name": null, "numpy_type": "object", "pandas_type": "unicode"}]
> {code}
> See the attached file for a test case.
> So specifically, it appears that dataframe->table->dataframe->table 
> conversion produces a different result from just dataframe->table - which I 
> think is unexpected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2030) NativeFile's Attributes are not exposed in child classes without explicit initialization

2018-01-24 Thread Phillip Cloud (JIRA)
Phillip Cloud created ARROW-2030:


 Summary: NativeFile's Attributes are not exposed in child classes 
without explicit initialization
 Key: ARROW-2030
 URL: https://issues.apache.org/jira/browse/ARROW-2030
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.8.0
Reporter: Phillip Cloud
Assignee: Phillip Cloud


This shows up in calling `self._assert_readable()` which tries to read from a 
property that isn't exposed in child classes.

This is one of those parts of Cython that's fairly easy to trip over since its 
{{cdef}} classes don't work like Python classes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2030) NativeFile's Attributes are not exposed in child classes without explicit initialization

2018-01-24 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud updated ARROW-2030:
-
Description: 
This shows up in calling {{self._assert_readable() }}which tries to read from a 
property that isn't exposed in child classes.

This is one of those parts of Cython that's fairly easy to trip over since its 
{{cdef}} classes don't work like Python classes.

  was:
This shows up in calling `self._assert_readable()` which tries to read from a 
property that isn't exposed in child classes.

This is one of those parts of Cython that's fairly easy to trip over since its 
{{cdef}} classes don't work like Python classes.


> NativeFile's Attributes are not exposed in child classes without explicit 
> initialization
> 
>
> Key: ARROW-2030
> URL: https://issues.apache.org/jira/browse/ARROW-2030
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>
> This shows up in calling {{self._assert_readable() }}which tries to read from 
> a property that isn't exposed in child classes.
> This is one of those parts of Cython that's fairly easy to trip over since 
> its {{cdef}} classes don't work like Python classes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2030) NativeFile's Attributes are not exposed in child classes without explicit initialization

2018-01-24 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud updated ARROW-2030:
-
Description: 
This shows up in calling \{{self._assert_readable()}} which tries to read from 
a property that isn't exposed in child classes.

This is one of those parts of Cython that's fairly easy to trip over since its 
{{cdef}} classes don't work like Python classes.

  was:
This shows up in calling {{self._assert_readable() }}which tries to read from a 
property that isn't exposed in child classes.

This is one of those parts of Cython that's fairly easy to trip over since its 
{{cdef}} classes don't work like Python classes.


> NativeFile's Attributes are not exposed in child classes without explicit 
> initialization
> 
>
> Key: ARROW-2030
> URL: https://issues.apache.org/jira/browse/ARROW-2030
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>
> This shows up in calling \{{self._assert_readable()}} which tries to read 
> from a property that isn't exposed in child classes.
> This is one of those parts of Cython that's fairly easy to trip over since 
> its {{cdef}} classes don't work like Python classes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2037) [Python]: Add tests for ARROW-1941 cases where pandas inferred type is 'empty'

2018-01-25 Thread Phillip Cloud (JIRA)
Phillip Cloud created ARROW-2037:


 Summary: [Python]: Add tests for ARROW-1941 cases where pandas 
inferred type is 'empty'
 Key: ARROW-2037
 URL: https://issues.apache.org/jira/browse/ARROW-2037
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.8.0
Reporter: Phillip Cloud
Assignee: Phillip Cloud






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2084) [C++] Support newer Brotli static library names on Windows

2018-02-03 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud reassigned ARROW-2084:


Assignee: Phillip Cloud  (was: Uwe L. Korn)

> [C++] Support newer Brotli static library names on Windows
> --
>
> Key: ARROW-2084
> URL: https://issues.apache.org/jira/browse/ARROW-2084
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Uwe L. Korn
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> They use {{-}} instead of {{_}} now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2084) [C++] Support newer Brotli static library names

2018-02-03 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud updated ARROW-2084:
-
Summary: [C++] Support newer Brotli static library names  (was: [C++] 
Support newer Brotli static library names on Windows)

> [C++] Support newer Brotli static library names
> ---
>
> Key: ARROW-2084
> URL: https://issues.apache.org/jira/browse/ARROW-2084
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Uwe L. Korn
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> They use {{-}} instead of {{_}} now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-2030) NativeFile's Attributes are not exposed in child classes without explicit initialization

2018-02-07 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud closed ARROW-2030.

Resolution: Invalid

Closing as not an issue.

> NativeFile's Attributes are not exposed in child classes without explicit 
> initialization
> 
>
> Key: ARROW-2030
> URL: https://issues.apache.org/jira/browse/ARROW-2030
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>
> This shows up in calling \{{self._assert_readable()}} which tries to read 
> from a property that isn't exposed in child classes.
> This is one of those parts of Cython that's fairly easy to trip over since 
> its {{cdef}} classes don't work like Python classes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1973) [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes.

2018-02-07 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356138#comment-16356138
 ] 

Phillip Cloud commented on ARROW-1973:
--

Working on this.

> [Python] Memory leak when converting Arrow tables with array columns to 
> Pandas dataframes.
> --
>
> Key: ARROW-1973
> URL: https://issues.apache.org/jira/browse/ARROW-1973
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Phillip Cloud
>Priority: Major
> Fix For: 0.9.0
>
>
> There appears to be a memory leak when using PyArrow to convert tables 
> containing array columns to Pandas DataFrames.
>  See the `test_memory_leak.py` example here: 
> https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-1950) [Python] pandas_type in pandas metadata incorrect for List types

2018-02-07 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud resolved ARROW-1950.
--
Resolution: Fixed

Issue resolved by pull request 1571
[https://github.com/apache/arrow/pull/1571]

> [Python] pandas_type in pandas metadata incorrect for List types
> 
>
> Key: ARROW-1950
> URL: https://issues.apache.org/jira/browse/ARROW-1950
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> see https://github.com/pandas-dev/pandas/pull/18201#issuecomment-353042438



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2117) [C++] Pin clang to version 5.0

2018-02-08 Thread Phillip Cloud (JIRA)
Phillip Cloud created ARROW-2117:


 Summary: [C++] Pin clang to version 5.0
 Key: ARROW-2117
 URL: https://issues.apache.org/jira/browse/ARROW-2117
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.9.0
Reporter: Phillip Cloud
Assignee: Phillip Cloud


Let's do this after the next release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-1973) [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes.

2018-02-09 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud resolved ARROW-1973.
--
Resolution: Fixed

Issue resolved by pull request 1578
[https://github.com/apache/arrow/pull/1578]

> [Python] Memory leak when converting Arrow tables with array columns to 
> Pandas dataframes.
> --
>
> Key: ARROW-1973
> URL: https://issues.apache.org/jira/browse/ARROW-1973
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> There appears to be a memory leak when using PyArrow to convert tables 
> containing array columns to Pandas DataFrames.
>  See the `test_memory_leak.py` example here: 
> https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2049) ARROW-2049: [Python] Use python -m cython to run Cython, instead of CYTHON_EXECUTABLE

2018-02-09 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud updated ARROW-2049:
-
Summary: ARROW-2049: [Python] Use python -m cython to run Cython, instead 
of CYTHON_EXECUTABLE  (was: setup.py doesn't pick a cython executable that 
resides in the venv used for running setup.py)

> ARROW-2049: [Python] Use python -m cython to run Cython, instead of 
> CYTHON_EXECUTABLE
> -
>
> Key: ARROW-2049
> URL: https://issues.apache.org/jira/browse/ARROW-2049
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Moriyoshi Koizumi
>Priority: Trivial
>  Labels: pull-request-available
>
> setup.py doesn't even try to detect the right cython executable specially 
> when it is used in virtualenv. Instead it always tries to use the one found 
> in the search path.
>  
> I am going to create the PR on GitHub accordingly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2049) ARROW-2049: [Python] Use python -m cython to run Cython, instead of CYTHON_EXECUTABLE

2018-02-09 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud reassigned ARROW-2049:


Assignee: Phillip Cloud

> ARROW-2049: [Python] Use python -m cython to run Cython, instead of 
> CYTHON_EXECUTABLE
> -
>
> Key: ARROW-2049
> URL: https://issues.apache.org/jira/browse/ARROW-2049
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Moriyoshi Koizumi
>Assignee: Phillip Cloud
>Priority: Trivial
>  Labels: pull-request-available
>
> setup.py doesn't even try to detect the right cython executable specially 
> when it is used in virtualenv. Instead it always tries to use the one found 
> in the search path.
>  
> I am going to create the PR on GitHub accordingly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2137) [Python] Don't print paths that are ignored when reading Parquet files

2018-02-12 Thread Phillip Cloud (JIRA)
Phillip Cloud created ARROW-2137:


 Summary: [Python] Don't print paths that are ignored when reading 
Parquet files
 Key: ARROW-2137
 URL: https://issues.apache.org/jira/browse/ARROW-2137
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 0.8.0
Reporter: Phillip Cloud
Assignee: Phillip Cloud
 Fix For: 0.9.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2145) decimal conversion not working for NaN values

2018-02-13 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud reassigned ARROW-2145:


Assignee: Phillip Cloud

> decimal conversion not working for NaN values
> -
>
> Key: ARROW-2145
> URL: https://issues.apache.org/jira/browse/ARROW-2145
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Assignee: Phillip Cloud
>Priority: Major
>
> {code:python}
> import pyarrow as pa
> import pandas as pd
> import decimal
> pa.Table.from_pandas(pd.DataFrame({'a': [decimal.Decimal('1.1'), 
> decimal.Decimal('NaN')]}))
> {code}
> throws following exception:
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/table.pxi", line 875, in pyarrow.lib.Table.from_pandas 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:44927)
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 350, in 
> dataframe_to_arrays
> convert_types)]
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 349, in 
> 
> for c, t in zip(columns_to_convert,
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 345, in 
> convert_column
> return pa.array(col, from_pandas=True, type=ty)
>   File "pyarrow/array.pxi", line 170, in pyarrow.lib.array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:29224)
>   File "pyarrow/array.pxi", line 70, in pyarrow.lib._ndarray_to_array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:28465)
>   File "pyarrow/error.pxi", line 98, in pyarrow.lib.check_status 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:9068)
> pyarrow.lib.ArrowException: Unknown error: an integer is required (got type 
> str)
> {code}
> Same problem with other special decimal values like {{infinity}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2145) decimal conversion not working for NaN values

2018-02-13 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16363233#comment-16363233
 ] 

Phillip Cloud commented on ARROW-2145:
--

Thanks for the report, taking a look now.

> decimal conversion not working for NaN values
> -
>
> Key: ARROW-2145
> URL: https://issues.apache.org/jira/browse/ARROW-2145
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Assignee: Phillip Cloud
>Priority: Major
>
> {code:python}
> import pyarrow as pa
> import pandas as pd
> import decimal
> pa.Table.from_pandas(pd.DataFrame({'a': [decimal.Decimal('1.1'), 
> decimal.Decimal('NaN')]}))
> {code}
> throws following exception:
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/table.pxi", line 875, in pyarrow.lib.Table.from_pandas 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:44927)
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 350, in 
> dataframe_to_arrays
> convert_types)]
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 349, in 
> 
> for c, t in zip(columns_to_convert,
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 345, in 
> convert_column
> return pa.array(col, from_pandas=True, type=ty)
>   File "pyarrow/array.pxi", line 170, in pyarrow.lib.array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:29224)
>   File "pyarrow/array.pxi", line 70, in pyarrow.lib._ndarray_to_array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:28465)
>   File "pyarrow/error.pxi", line 98, in pyarrow.lib.check_status 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:9068)
> pyarrow.lib.ArrowException: Unknown error: an integer is required (got type 
> str)
> {code}
> Same problem with other special decimal values like {{infinity}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2145) decimal conversion not working for NaN values

2018-02-13 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16363332#comment-16363332
 ] 

Phillip Cloud commented on ARROW-2145:
--

[~antonymayi] Do you have a specific use case for this, or were you tinkering 
around and trying a few things?

> decimal conversion not working for NaN values
> -
>
> Key: ARROW-2145
> URL: https://issues.apache.org/jira/browse/ARROW-2145
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Assignee: Phillip Cloud
>Priority: Major
>
> {code:python}
> import pyarrow as pa
> import pandas as pd
> import decimal
> pa.Table.from_pandas(pd.DataFrame({'a': [decimal.Decimal('1.1'), 
> decimal.Decimal('NaN')]}))
> {code}
> throws following exception:
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/table.pxi", line 875, in pyarrow.lib.Table.from_pandas 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:44927)
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 350, in 
> dataframe_to_arrays
> convert_types)]
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 349, in 
> 
> for c, t in zip(columns_to_convert,
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 345, in 
> convert_column
> return pa.array(col, from_pandas=True, type=ty)
>   File "pyarrow/array.pxi", line 170, in pyarrow.lib.array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:29224)
>   File "pyarrow/array.pxi", line 70, in pyarrow.lib._ndarray_to_array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:28465)
>   File "pyarrow/error.pxi", line 98, in pyarrow.lib.check_status 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:9068)
> pyarrow.lib.ArrowException: Unknown error: an integer is required (got type 
> str)
> {code}
> Same problem with other special decimal values like {{infinity}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2153) decimal conversion not working for exponential notation

2018-02-14 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16364319#comment-16364319
 ] 

Phillip Cloud commented on ARROW-2153:
--

If this is blocking you, you ca use a decimal point as a workaround:

 
{code}
pa.array(pd.Series([Decimal('-3.0e+1')]))
{code}

I'll put up a fix today for this.

> decimal conversion not working for exponential notation
> ---
>
> Key: ARROW-2153
> URL: https://issues.apache.org/jira/browse/ARROW-2153
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Priority: Major
>
> {code:java}
> import pyarrow as pa
> import pandas as pd
> import decimal
> pa.Table.from_pandas(pd.DataFrame({'a': [decimal.Decimal('1.1'), 
> decimal.Decimal('2E+1')]}))
> {code}
>  
> {code:java}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/table.pxi", line 875, in pyarrow.lib.Table.from_pandas 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:44927)
>   File 
> "/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 350, in dataframe_to_arrays
> convert_types)]
>   File 
> "/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 349, in 
> for c, t in zip(columns_to_convert,
>   File 
> "/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 345, in convert_column
> return pa.array(col, from_pandas=True, type=ty)
>   File "pyarrow/array.pxi", line 170, in pyarrow.lib.array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:29224)
>   File "pyarrow/array.pxi", line 70, in pyarrow.lib._ndarray_to_array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:28465)
>   File "pyarrow/error.pxi", line 77, in pyarrow.lib.check_status 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:8270)
> pyarrow.lib.ArrowInvalid: Expected base ten digit or decimal point but found 
> 'E' instead.
> {code}
> In manual cases clearly we can write {{decimal.Decimal('20')}} instead of 
> {{decimal.Decimal('2E+1')}} but during arithmetical operations inside an 
> application the exponential notation can be produced out of control (it is 
> actually the _normalized_ form of the decimal number) plus for some values 
> the exponential notation is the only form expressing the significance so this 
> should be accepted.
> The [documentation|https://docs.python.org/3/library/decimal.html] suggests 
> using following transformation but that's only possible when the significance 
> information doesn't need to be kept:
> {code:java}
> def remove_exponent(d):
> return d.quantize(Decimal(1)) if d == d.to_integral() else d.normalize()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-2153) decimal conversion not working for exponential notation

2018-02-14 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16364319#comment-16364319
 ] 

Phillip Cloud edited comment on ARROW-2153 at 2/14/18 3:55 PM:
---

If this is blocking you, you ca use a decimal point as a workaround:

{code}
pa.array(pd.Series([Decimal('-3.0e+1')]))
{code}

I'll put up a fix today for this.


was (Author: cpcloud):
If this is blocking you, you ca use a decimal point as a workaround:

 
{code}
pa.array(pd.Series([Decimal('-3.0e+1')]))
{code}

I'll put up a fix today for this.

> decimal conversion not working for exponential notation
> ---
>
> Key: ARROW-2153
> URL: https://issues.apache.org/jira/browse/ARROW-2153
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Priority: Major
>
> {code:java}
> import pyarrow as pa
> import pandas as pd
> import decimal
> pa.Table.from_pandas(pd.DataFrame({'a': [decimal.Decimal('1.1'), 
> decimal.Decimal('2E+1')]}))
> {code}
>  
> {code:java}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/table.pxi", line 875, in pyarrow.lib.Table.from_pandas 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:44927)
>   File 
> "/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 350, in dataframe_to_arrays
> convert_types)]
>   File 
> "/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 349, in 
> for c, t in zip(columns_to_convert,
>   File 
> "/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 345, in convert_column
> return pa.array(col, from_pandas=True, type=ty)
>   File "pyarrow/array.pxi", line 170, in pyarrow.lib.array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:29224)
>   File "pyarrow/array.pxi", line 70, in pyarrow.lib._ndarray_to_array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:28465)
>   File "pyarrow/error.pxi", line 77, in pyarrow.lib.check_status 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:8270)
> pyarrow.lib.ArrowInvalid: Expected base ten digit or decimal point but found 
> 'E' instead.
> {code}
> In manual cases clearly we can write {{decimal.Decimal('20')}} instead of 
> {{decimal.Decimal('2E+1')}} but during arithmetical operations inside an 
> application the exponential notation can be produced out of control (it is 
> actually the _normalized_ form of the decimal number) plus for some values 
> the exponential notation is the only form expressing the significance so this 
> should be accepted.
> The [documentation|https://docs.python.org/3/library/decimal.html] suggests 
> using following transformation but that's only possible when the significance 
> information doesn't need to be kept:
> {code:java}
> def remove_exponent(d):
> return d.quantize(Decimal(1)) if d == d.to_integral() else d.normalize()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2153) decimal conversion not working for exponential notation

2018-02-14 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud reassigned ARROW-2153:


Assignee: Phillip Cloud

> decimal conversion not working for exponential notation
> ---
>
> Key: ARROW-2153
> URL: https://issues.apache.org/jira/browse/ARROW-2153
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Assignee: Phillip Cloud
>Priority: Major
>
> {code:java}
> import pyarrow as pa
> import pandas as pd
> import decimal
> pa.Table.from_pandas(pd.DataFrame({'a': [decimal.Decimal('1.1'), 
> decimal.Decimal('2E+1')]}))
> {code}
>  
> {code:java}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/table.pxi", line 875, in pyarrow.lib.Table.from_pandas 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:44927)
>   File 
> "/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 350, in dataframe_to_arrays
> convert_types)]
>   File 
> "/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 349, in 
> for c, t in zip(columns_to_convert,
>   File 
> "/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 345, in convert_column
> return pa.array(col, from_pandas=True, type=ty)
>   File "pyarrow/array.pxi", line 170, in pyarrow.lib.array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:29224)
>   File "pyarrow/array.pxi", line 70, in pyarrow.lib._ndarray_to_array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:28465)
>   File "pyarrow/error.pxi", line 77, in pyarrow.lib.check_status 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:8270)
> pyarrow.lib.ArrowInvalid: Expected base ten digit or decimal point but found 
> 'E' instead.
> {code}
> In manual cases clearly we can write {{decimal.Decimal('20')}} instead of 
> {{decimal.Decimal('2E+1')}} but during arithmetical operations inside an 
> application the exponential notation can be produced out of control (it is 
> actually the _normalized_ form of the decimal number) plus for some values 
> the exponential notation is the only form expressing the significance so this 
> should be accepted.
> The [documentation|https://docs.python.org/3/library/decimal.html] suggests 
> using following transformation but that's only possible when the significance 
> information doesn't need to be kept:
> {code:java}
> def remove_exponent(d):
> return d.quantize(Decimal(1)) if d == d.to_integral() else d.normalize()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-2153) decimal conversion not working for exponential notation

2018-02-14 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16364319#comment-16364319
 ] 

Phillip Cloud edited comment on ARROW-2153 at 2/14/18 3:55 PM:
---

If this is blocking you, you can use a decimal point as a workaround:

{code}
pa.array(pd.Series([Decimal('-3.0e+1')]))
{code}

I'll put up a fix today for this.


was (Author: cpcloud):
If this is blocking you, you ca use a decimal point as a workaround:

{code}
pa.array(pd.Series([Decimal('-3.0e+1')]))
{code}

I'll put up a fix today for this.

> decimal conversion not working for exponential notation
> ---
>
> Key: ARROW-2153
> URL: https://issues.apache.org/jira/browse/ARROW-2153
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Priority: Major
>
> {code:java}
> import pyarrow as pa
> import pandas as pd
> import decimal
> pa.Table.from_pandas(pd.DataFrame({'a': [decimal.Decimal('1.1'), 
> decimal.Decimal('2E+1')]}))
> {code}
>  
> {code:java}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/table.pxi", line 875, in pyarrow.lib.Table.from_pandas 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:44927)
>   File 
> "/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 350, in dataframe_to_arrays
> convert_types)]
>   File 
> "/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 349, in 
> for c, t in zip(columns_to_convert,
>   File 
> "/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 345, in convert_column
> return pa.array(col, from_pandas=True, type=ty)
>   File "pyarrow/array.pxi", line 170, in pyarrow.lib.array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:29224)
>   File "pyarrow/array.pxi", line 70, in pyarrow.lib._ndarray_to_array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:28465)
>   File "pyarrow/error.pxi", line 77, in pyarrow.lib.check_status 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:8270)
> pyarrow.lib.ArrowInvalid: Expected base ten digit or decimal point but found 
> 'E' instead.
> {code}
> In manual cases clearly we can write {{decimal.Decimal('20')}} instead of 
> {{decimal.Decimal('2E+1')}} but during arithmetical operations inside an 
> application the exponential notation can be produced out of control (it is 
> actually the _normalized_ form of the decimal number) plus for some values 
> the exponential notation is the only form expressing the significance so this 
> should be accepted.
> The [documentation|https://docs.python.org/3/library/decimal.html] suggests 
> using following transformation but that's only possible when the significance 
> information doesn't need to be kept:
> {code:java}
> def remove_exponent(d):
> return d.quantize(Decimal(1)) if d == d.to_integral() else d.normalize()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2157) Decimal arrays cannot be constructed from Python lists

2018-02-14 Thread Phillip Cloud (JIRA)
Phillip Cloud created ARROW-2157:


 Summary: Decimal arrays cannot be constructed from Python lists
 Key: ARROW-2157
 URL: https://issues.apache.org/jira/browse/ARROW-2157
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.8.0
Reporter: Phillip Cloud
Assignee: Phillip Cloud
 Fix For: 0.9.0


{code}
In [14]: pa.array([Decimal('1')])
---
ArrowInvalid  Traceback (most recent call last)
 in ()
> 1 pa.array([Decimal('1')])

array.pxi in pyarrow.lib.array()

array.pxi in pyarrow.lib._sequence_to_array()

error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Error inferring Arrow data type for collection of Python objects. 
Got Python object of type Decimal but can only handle these types: bool, float, 
integer, date, datetime, bytes, unicode
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2145) decimal conversion not working for NaN values

2018-02-14 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16364830#comment-16364830
 ] 

Phillip Cloud commented on ARROW-2145:
--

Both are definitely bugs, working on a fix.

> decimal conversion not working for NaN values
> -
>
> Key: ARROW-2145
> URL: https://issues.apache.org/jira/browse/ARROW-2145
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Assignee: Phillip Cloud
>Priority: Major
>
> {code:python}
> import pyarrow as pa
> import pandas as pd
> import decimal
> pa.Table.from_pandas(pd.DataFrame({'a': [decimal.Decimal('1.1'), 
> decimal.Decimal('NaN')]}))
> {code}
> throws following exception:
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/table.pxi", line 875, in pyarrow.lib.Table.from_pandas 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:44927)
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 350, in 
> dataframe_to_arrays
> convert_types)]
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 349, in 
> 
> for c, t in zip(columns_to_convert,
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 345, in 
> convert_column
> return pa.array(col, from_pandas=True, type=ty)
>   File "pyarrow/array.pxi", line 170, in pyarrow.lib.array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:29224)
>   File "pyarrow/array.pxi", line 70, in pyarrow.lib._ndarray_to_array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:28465)
>   File "pyarrow/error.pxi", line 98, in pyarrow.lib.check_status 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:9068)
> pyarrow.lib.ArrowException: Unknown error: an integer is required (got type 
> str)
> {code}
> Same problem with other special decimal values like {{infinity}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2145) [Python] Decimal conversion not working for NaN values

2018-02-14 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud updated ARROW-2145:
-
Summary: [Python] Decimal conversion not working for NaN values  (was: 
decimal conversion not working for NaN values)

> [Python] Decimal conversion not working for NaN values
> --
>
> Key: ARROW-2145
> URL: https://issues.apache.org/jira/browse/ARROW-2145
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Assignee: Phillip Cloud
>Priority: Major
>
> {code:python}
> import pyarrow as pa
> import pandas as pd
> import decimal
> pa.Table.from_pandas(pd.DataFrame({'a': [decimal.Decimal('1.1'), 
> decimal.Decimal('NaN')]}))
> {code}
> throws following exception:
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/table.pxi", line 875, in pyarrow.lib.Table.from_pandas 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:44927)
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 350, in 
> dataframe_to_arrays
> convert_types)]
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 349, in 
> 
> for c, t in zip(columns_to_convert,
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 345, in 
> convert_column
> return pa.array(col, from_pandas=True, type=ty)
>   File "pyarrow/array.pxi", line 170, in pyarrow.lib.array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:29224)
>   File "pyarrow/array.pxi", line 70, in pyarrow.lib._ndarray_to_array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:28465)
>   File "pyarrow/error.pxi", line 98, in pyarrow.lib.check_status 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:9068)
> pyarrow.lib.ArrowException: Unknown error: an integer is required (got type 
> str)
> {code}
> Same problem with other special decimal values like {{infinity}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2158) [Python] Construction of Decimal array with None or np.nan fails

2018-02-14 Thread Phillip Cloud (JIRA)
Phillip Cloud created ARROW-2158:


 Summary: [Python] Construction of Decimal array with None or 
np.nan fails
 Key: ARROW-2158
 URL: https://issues.apache.org/jira/browse/ARROW-2158
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.8.0
Reporter: Phillip Cloud
Assignee: Phillip Cloud
 Fix For: 0.9.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-2158) [Python] Construction of Decimal array with None or np.nan fails

2018-02-14 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud closed ARROW-2158.

Resolution: Duplicate

This is really just ARROW-2145

> [Python] Construction of Decimal array with None or np.nan fails
> 
>
> Key: ARROW-2158
> URL: https://issues.apache.org/jira/browse/ARROW-2158
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2160) decimal precision inference

2018-02-14 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud reassigned ARROW-2160:


Assignee: Phillip Cloud

> decimal precision inference
> ---
>
> Key: ARROW-2160
> URL: https://issues.apache.org/jira/browse/ARROW-2160
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Assignee: Phillip Cloud
>Priority: Major
>
> {code}
> import pyarrow as pa
> import pandas as pd
> import decimal
> df = pd.DataFrame({'a': [decimal.Decimal('0.1'), decimal.Decimal('0.01')]})
> pa.Table.from_pandas(df)
> {code}
> raises:
> {code}
> pyarrow.lib.ArrowInvalid: Decimal type with precision 2 does not fit into 
> precision inferred from first array element: 1
> {code}
> Looks arrow is inferring the highest precision for given column based on the 
> first cell and expecting the rest fits in. I understand this is by design but 
> from the point of view of pandas-arrow compatibility this is quite painful as 
> pandas is more flexible (as demonstrated).
> What this means is that user trying to pass pandas {{DataFrame}} with 
> {{Decimal}} column(s) to arrow {{Table}} would always have to first:
> # Find the highest precision used in (each of) that column(s)
> # Adjust the first cell of (each of) that column(s) so it has the highest 
> precision of that column(s)
> # Only then pass such {{DataFrame}} to {{Table.from_pandas()}}
> So given this unavoidable procedure (and assuming arrow needs to be strict 
> about the highest precision for a column) - shouldn't some similar logic be 
> part of the {{Table.from_pandas()}} directly to make this transparent?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2160) [C++/Python] Decimal precision inference

2018-02-14 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud updated ARROW-2160:
-
Summary: [C++/Python]  Decimal precision inference  (was: decimal precision 
inference)

> [C++/Python]  Decimal precision inference
> -
>
> Key: ARROW-2160
> URL: https://issues.apache.org/jira/browse/ARROW-2160
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Assignee: Phillip Cloud
>Priority: Major
>
> {code}
> import pyarrow as pa
> import pandas as pd
> import decimal
> df = pd.DataFrame({'a': [decimal.Decimal('0.1'), decimal.Decimal('0.01')]})
> pa.Table.from_pandas(df)
> {code}
> raises:
> {code}
> pyarrow.lib.ArrowInvalid: Decimal type with precision 2 does not fit into 
> precision inferred from first array element: 1
> {code}
> Looks arrow is inferring the highest precision for given column based on the 
> first cell and expecting the rest fits in. I understand this is by design but 
> from the point of view of pandas-arrow compatibility this is quite painful as 
> pandas is more flexible (as demonstrated).
> What this means is that user trying to pass pandas {{DataFrame}} with 
> {{Decimal}} column(s) to arrow {{Table}} would always have to first:
> # Find the highest precision used in (each of) that column(s)
> # Adjust the first cell of (each of) that column(s) so it has the highest 
> precision of that column(s)
> # Only then pass such {{DataFrame}} to {{Table.from_pandas()}}
> So given this unavoidable procedure (and assuming arrow needs to be strict 
> about the highest precision for a column) - shouldn't some similar logic be 
> part of the {{Table.from_pandas()}} directly to make this transparent?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2160) [C++/Python] Decimal precision inference

2018-02-14 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud updated ARROW-2160:
-
Issue Type: Bug  (was: Improvement)

> [C++/Python]  Decimal precision inference
> -
>
> Key: ARROW-2160
> URL: https://issues.apache.org/jira/browse/ARROW-2160
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Assignee: Phillip Cloud
>Priority: Major
> Fix For: 0.9.0
>
>
> {code}
> import pyarrow as pa
> import pandas as pd
> import decimal
> df = pd.DataFrame({'a': [decimal.Decimal('0.1'), decimal.Decimal('0.01')]})
> pa.Table.from_pandas(df)
> {code}
> raises:
> {code}
> pyarrow.lib.ArrowInvalid: Decimal type with precision 2 does not fit into 
> precision inferred from first array element: 1
> {code}
> Looks arrow is inferring the highest precision for given column based on the 
> first cell and expecting the rest fits in. I understand this is by design but 
> from the point of view of pandas-arrow compatibility this is quite painful as 
> pandas is more flexible (as demonstrated).
> What this means is that user trying to pass pandas {{DataFrame}} with 
> {{Decimal}} column(s) to arrow {{Table}} would always have to first:
> # Find the highest precision used in (each of) that column(s)
> # Adjust the first cell of (each of) that column(s) so it has the highest 
> precision of that column(s)
> # Only then pass such {{DataFrame}} to {{Table.from_pandas()}}
> So given this unavoidable procedure (and assuming arrow needs to be strict 
> about the highest precision for a column) - shouldn't some similar logic be 
> part of the {{Table.from_pandas()}} directly to make this transparent?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2160) [C++/Python] Decimal precision inference

2018-02-14 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud updated ARROW-2160:
-
Fix Version/s: 0.9.0

> [C++/Python]  Decimal precision inference
> -
>
> Key: ARROW-2160
> URL: https://issues.apache.org/jira/browse/ARROW-2160
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Assignee: Phillip Cloud
>Priority: Major
> Fix For: 0.9.0
>
>
> {code}
> import pyarrow as pa
> import pandas as pd
> import decimal
> df = pd.DataFrame({'a': [decimal.Decimal('0.1'), decimal.Decimal('0.01')]})
> pa.Table.from_pandas(df)
> {code}
> raises:
> {code}
> pyarrow.lib.ArrowInvalid: Decimal type with precision 2 does not fit into 
> precision inferred from first array element: 1
> {code}
> Looks arrow is inferring the highest precision for given column based on the 
> first cell and expecting the rest fits in. I understand this is by design but 
> from the point of view of pandas-arrow compatibility this is quite painful as 
> pandas is more flexible (as demonstrated).
> What this means is that user trying to pass pandas {{DataFrame}} with 
> {{Decimal}} column(s) to arrow {{Table}} would always have to first:
> # Find the highest precision used in (each of) that column(s)
> # Adjust the first cell of (each of) that column(s) so it has the highest 
> precision of that column(s)
> # Only then pass such {{DataFrame}} to {{Table.from_pandas()}}
> So given this unavoidable procedure (and assuming arrow needs to be strict 
> about the highest precision for a column) - shouldn't some similar logic be 
> part of the {{Table.from_pandas()}} directly to make this transparent?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2160) [C++/Python] Decimal precision inference

2018-02-14 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16364934#comment-16364934
 ] 

Phillip Cloud commented on ARROW-2160:
--

Thanks for the report. I will make sure this goes into 0.9.0.

> [C++/Python]  Decimal precision inference
> -
>
> Key: ARROW-2160
> URL: https://issues.apache.org/jira/browse/ARROW-2160
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Assignee: Phillip Cloud
>Priority: Major
> Fix For: 0.9.0
>
>
> {code}
> import pyarrow as pa
> import pandas as pd
> import decimal
> df = pd.DataFrame({'a': [decimal.Decimal('0.1'), decimal.Decimal('0.01')]})
> pa.Table.from_pandas(df)
> {code}
> raises:
> {code}
> pyarrow.lib.ArrowInvalid: Decimal type with precision 2 does not fit into 
> precision inferred from first array element: 1
> {code}
> Looks arrow is inferring the highest precision for given column based on the 
> first cell and expecting the rest fits in. I understand this is by design but 
> from the point of view of pandas-arrow compatibility this is quite painful as 
> pandas is more flexible (as demonstrated).
> What this means is that user trying to pass pandas {{DataFrame}} with 
> {{Decimal}} column(s) to arrow {{Table}} would always have to first:
> # Find the highest precision used in (each of) that column(s)
> # Adjust the first cell of (each of) that column(s) so it has the highest 
> precision of that column(s)
> # Only then pass such {{DataFrame}} to {{Table.from_pandas()}}
> So given this unavoidable procedure (and assuming arrow needs to be strict 
> about the highest precision for a column) - shouldn't some similar logic be 
> part of the {{Table.from_pandas()}} directly to make this transparent?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2161) [Python] test_cython_api failing for a build_ext --inplace install

2018-02-14 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud updated ARROW-2161:
-
Summary: [Python] test_cython_api failing for a build_ext --inplace install 
 (was: test_cython_api failing for a build_ext --inplace install)

> [Python] test_cython_api failing for a build_ext --inplace install
> --
>
> Key: ARROW-2161
> URL: https://issues.apache.org/jira/browse/ARROW-2161
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Phillip Cloud
>Priority: Major
>
> {code}
> pytest pyarrow -x --tb=short 
> = test session starts 
> =
> platform linux -- Python 3.6.3, pytest-3.3.1, py-1.5.2, pluggy-0.6.0
> rootdir: /home/phillip/Documents/code/cpp/arrow/python, inifile: setup.cfg
> collected 580 items
> pyarrow/tests/test_array.py 
> ...   
>   [ 10%]
> pyarrow/tests/test_convert_builtin.py 
> 
>   [ 24%]
> pyarrow/tests/test_convert_pandas.py 
> ...x...s..
>  [ 38%]
> . 
>   [ 41%]
> pyarrow/tests/test_cython.py F
> == FAILURES 
> ===
> ___ test_cython_api 
> ___
> pyarrow/tests/test_cython.py:88: in test_cython_api
> 'build_ext', '--inplace'])
> /home/phillip/miniconda3/envs/pyarrow36/lib/python3.6/subprocess.py:291: in 
> check_call
> raise CalledProcessError(retcode, cmd)
> E   subprocess.CalledProcessError: Command 
> '['/home/phillip/miniconda3/envs/pyarrow36/bin/python', 'setup.py', 
> 'build_ext', '--inplace']' returned non-zero exit status 1.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2161) test_cython_api failing for a build_ext --inplace install

2018-02-14 Thread Phillip Cloud (JIRA)
Phillip Cloud created ARROW-2161:


 Summary: test_cython_api failing for a build_ext --inplace install
 Key: ARROW-2161
 URL: https://issues.apache.org/jira/browse/ARROW-2161
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Phillip Cloud


{code}
pytest pyarrow -x --tb=short 
= test session starts 
=
platform linux -- Python 3.6.3, pytest-3.3.1, py-1.5.2, pluggy-0.6.0
rootdir: /home/phillip/Documents/code/cpp/arrow/python, inifile: setup.cfg
collected 580 items

pyarrow/tests/test_array.py 
... 
[ 10%]
pyarrow/tests/test_convert_builtin.py 

  [ 24%]
pyarrow/tests/test_convert_pandas.py 
...x...s..
 [ 38%]
.   
[ 41%]
pyarrow/tests/test_cython.py F

== FAILURES 
===
___ test_cython_api 
___
pyarrow/tests/test_cython.py:88: in test_cython_api
'build_ext', '--inplace'])
/home/phillip/miniconda3/envs/pyarrow36/lib/python3.6/subprocess.py:291: in 
check_call
raise CalledProcessError(retcode, cmd)
E   subprocess.CalledProcessError: Command 
'['/home/phillip/miniconda3/envs/pyarrow36/bin/python', 'setup.py', 
'build_ext', '--inplace']' returned non-zero exit status 1.
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2161) [Python] test_cython_api failing for a build_ext --inplace install

2018-02-14 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud updated ARROW-2161:
-
Affects Version/s: 0.8.0

> [Python] test_cython_api failing for a build_ext --inplace install
> --
>
> Key: ARROW-2161
> URL: https://issues.apache.org/jira/browse/ARROW-2161
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Priority: Major
>
> {code}
> pytest pyarrow -x --tb=short 
> = test session starts 
> =
> platform linux -- Python 3.6.3, pytest-3.3.1, py-1.5.2, pluggy-0.6.0
> rootdir: /home/phillip/Documents/code/cpp/arrow/python, inifile: setup.cfg
> collected 580 items
> pyarrow/tests/test_array.py 
> ...   
>   [ 10%]
> pyarrow/tests/test_convert_builtin.py 
> 
>   [ 24%]
> pyarrow/tests/test_convert_pandas.py 
> ...x...s..
>  [ 38%]
> . 
>   [ 41%]
> pyarrow/tests/test_cython.py F
> == FAILURES 
> ===
> ___ test_cython_api 
> ___
> pyarrow/tests/test_cython.py:88: in test_cython_api
> 'build_ext', '--inplace'])
> /home/phillip/miniconda3/envs/pyarrow36/lib/python3.6/subprocess.py:291: in 
> check_call
> raise CalledProcessError(retcode, cmd)
> E   subprocess.CalledProcessError: Command 
> '['/home/phillip/miniconda3/envs/pyarrow36/bin/python', 'setup.py', 
> 'build_ext', '--inplace']' returned non-zero exit status 1.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2161) [Python] test_cython_api failing for a build_ext --inplace install

2018-02-14 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud updated ARROW-2161:
-
Component/s: Python

> [Python] test_cython_api failing for a build_ext --inplace install
> --
>
> Key: ARROW-2161
> URL: https://issues.apache.org/jira/browse/ARROW-2161
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Priority: Major
>
> {code}
> pytest pyarrow -x --tb=short 
> = test session starts 
> =
> platform linux -- Python 3.6.3, pytest-3.3.1, py-1.5.2, pluggy-0.6.0
> rootdir: /home/phillip/Documents/code/cpp/arrow/python, inifile: setup.cfg
> collected 580 items
> pyarrow/tests/test_array.py 
> ...   
>   [ 10%]
> pyarrow/tests/test_convert_builtin.py 
> 
>   [ 24%]
> pyarrow/tests/test_convert_pandas.py 
> ...x...s..
>  [ 38%]
> . 
>   [ 41%]
> pyarrow/tests/test_cython.py F
> == FAILURES 
> ===
> ___ test_cython_api 
> ___
> pyarrow/tests/test_cython.py:88: in test_cython_api
> 'build_ext', '--inplace'])
> /home/phillip/miniconda3/envs/pyarrow36/lib/python3.6/subprocess.py:291: in 
> check_call
> raise CalledProcessError(retcode, cmd)
> E   subprocess.CalledProcessError: Command 
> '['/home/phillip/miniconda3/envs/pyarrow36/bin/python', 'setup.py', 
> 'build_ext', '--inplace']' returned non-zero exit status 1.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2161) [Python] test_cython_api failing for a build_ext --inplace install

2018-02-14 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud updated ARROW-2161:
-
Description: 
{code}
pytest pyarrow -x --tb=short
= test session starts 
=
platform linux -- Python 3.6.3, pytest-3.3.1, py-1.5.2, pluggy-0.6.0
rootdir: /home/phillip/Documents/code/cpp/arrow/python, inifile: setup.cfg
collected 580 items

pyarrow/tests/test_array.py 
... 
[ 10%]
pyarrow/tests/test_convert_builtin.py 

  [ 24%]
pyarrow/tests/test_convert_pandas.py 
...x...s..
 [ 38%]
.   
[ 41%]
pyarrow/tests/test_cython.py F

== FAILURES 
===
___ test_cython_api 
___
pyarrow/tests/test_cython.py:88: in test_cython_api
'build_ext', '--inplace'])
/home/phillip/miniconda3/envs/pyarrow36/lib/python3.6/subprocess.py:291: in 
check_call
raise CalledProcessError(retcode, cmd)
E   subprocess.CalledProcessError: Command 
'['/home/phillip/miniconda3/envs/pyarrow36/bin/python', 'setup.py', 
'build_ext', '--inplace']' returned non-zero exit status 1.
{code}

  was:
{code}
pytest pyarrow -x --tb=short 
= test session starts 
=
platform linux -- Python 3.6.3, pytest-3.3.1, py-1.5.2, pluggy-0.6.0
rootdir: /home/phillip/Documents/code/cpp/arrow/python, inifile: setup.cfg
collected 580 items

pyarrow/tests/test_array.py 
... 
[ 10%]
pyarrow/tests/test_convert_builtin.py 

  [ 24%]
pyarrow/tests/test_convert_pandas.py 
...x...s..
 [ 38%]
.   
[ 41%]
pyarrow/tests/test_cython.py F

== FAILURES 
===
___ test_cython_api 
___
pyarrow/tests/test_cython.py:88: in test_cython_api
'build_ext', '--inplace'])
/home/phillip/miniconda3/envs/pyarrow36/lib/python3.6/subprocess.py:291: in 
check_call
raise CalledProcessError(retcode, cmd)
E   subprocess.CalledProcessError: Command 
'['/home/phillip/miniconda3/envs/pyarrow36/bin/python', 'setup.py', 
'build_ext', '--inplace']' returned non-zero exit status 1.
{code}


> [Python] test_cython_api failing for a build_ext --inplace install
> --
>
> Key: ARROW-2161
> URL: https://issues.apache.org/jira/browse/ARROW-2161
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Priority: Major
>
> {code}
> pytest pyarrow -x --tb=short
> = test session starts 
> =
> platform linux -- Python 3.6.3, pytest-3.3.1, py-1.5.2, pluggy-0.6.0
> rootdir: /home/phillip/Documents/code/cpp/arrow/python, inifile: setup.cfg
> collected 580 items
> pyarrow/tests/test_array.py 
> ...   
>   [ 10%]
> pyarrow/tests/test_convert_builtin.py 
> 
>   [ 24%]
> pyarrow/tests/test_convert_pandas.py 
> ...x...s..
>  [ 38%]
> . 
>   [ 41%]
> pyarrow/tests/test_cython.py F
> == FAILURES 
> ===
> ___ test_cython_api 
> ___
> pyarrow/tests/test_cython.py:88: in test_cython_api
> 'build_ext', '--inplace'])
> /home/philli

[jira] [Assigned] (ARROW-2117) [C++] Pin clang to version 5.0

2018-02-14 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud reassigned ARROW-2117:


Assignee: (was: Phillip Cloud)

> [C++] Pin clang to version 5.0
> --
>
> Key: ARROW-2117
> URL: https://issues.apache.org/jira/browse/ARROW-2117
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.9.0
>Reporter: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> Let's do this after the next release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2160) [C++/Python] Decimal precision inference

2018-02-14 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16364996#comment-16364996
 ] 

Phillip Cloud commented on ARROW-2160:
--

This is actually a bug, since we _do_ perform the inference if the user does 
not pass a type in. The case of all leading zeros isn't handled.

> [C++/Python]  Decimal precision inference
> -
>
> Key: ARROW-2160
> URL: https://issues.apache.org/jira/browse/ARROW-2160
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Assignee: Phillip Cloud
>Priority: Major
> Fix For: 0.9.0
>
>
> {code}
> import pyarrow as pa
> import pandas as pd
> import decimal
> df = pd.DataFrame({'a': [decimal.Decimal('0.1'), decimal.Decimal('0.01')]})
> pa.Table.from_pandas(df)
> {code}
> raises:
> {code}
> pyarrow.lib.ArrowInvalid: Decimal type with precision 2 does not fit into 
> precision inferred from first array element: 1
> {code}
> Looks arrow is inferring the highest precision for given column based on the 
> first cell and expecting the rest fits in. I understand this is by design but 
> from the point of view of pandas-arrow compatibility this is quite painful as 
> pandas is more flexible (as demonstrated).
> What this means is that user trying to pass pandas {{DataFrame}} with 
> {{Decimal}} column(s) to arrow {{Table}} would always have to first:
> # Find the highest precision used in (each of) that column(s)
> # Adjust the first cell of (each of) that column(s) so that it explicitly 
> uses the highest precision of that column(s)
> # Only then pass such {{DataFrame}} to {{Table.from_pandas()}}
> So given this unavoidable procedure (and assuming arrow needs to be strict 
> about the highest precision for a column) - shouldn't some similar logic be 
> part of the {{Table.from_pandas()}} directly to make this transparent?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2160) [C++/Python] Decimal precision inference

2018-02-14 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365074#comment-16365074
 ] 

Phillip Cloud commented on ARROW-2160:
--

The definitions are the same, modulo possibly the case of all leading zeros, 
which is what's in question here. Does SQL have a consistent definition of 
precision with respect to leading zeros?

> [C++/Python]  Decimal precision inference
> -
>
> Key: ARROW-2160
> URL: https://issues.apache.org/jira/browse/ARROW-2160
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Assignee: Phillip Cloud
>Priority: Major
> Fix For: 0.9.0
>
>
> {code}
> import pyarrow as pa
> import pandas as pd
> import decimal
> df = pd.DataFrame({'a': [decimal.Decimal('0.1'), decimal.Decimal('0.01')]})
> pa.Table.from_pandas(df)
> {code}
> raises:
> {code}
> pyarrow.lib.ArrowInvalid: Decimal type with precision 2 does not fit into 
> precision inferred from first array element: 1
> {code}
> Looks arrow is inferring the highest precision for given column based on the 
> first cell and expecting the rest fits in. I understand this is by design but 
> from the point of view of pandas-arrow compatibility this is quite painful as 
> pandas is more flexible (as demonstrated).
> What this means is that user trying to pass pandas {{DataFrame}} with 
> {{Decimal}} column(s) to arrow {{Table}} would always have to first:
> # Find the highest precision used in (each of) that column(s)
> # Adjust the first cell of (each of) that column(s) so that it explicitly 
> uses the highest precision of that column(s)
> # Only then pass such {{DataFrame}} to {{Table.from_pandas()}}
> So given this unavoidable procedure (and assuming arrow needs to be strict 
> about the highest precision for a column) - shouldn't some similar logic be 
> part of the {{Table.from_pandas()}} directly to make this transparent?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2162) [Python/C++] Decimal Values with too-high precision are multiplied by 100

2018-02-14 Thread Phillip Cloud (JIRA)
Phillip Cloud created ARROW-2162:


 Summary: [Python/C++] Decimal Values with too-high precision are 
multiplied by 100
 Key: ARROW-2162
 URL: https://issues.apache.org/jira/browse/ARROW-2162
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Affects Versions: 0.8.0
Reporter: Phillip Cloud
Assignee: Phillip Cloud
 Fix For: 0.9.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2162) [Python/C++] Decimal Values with too-high precision are multiplied by 100

2018-02-15 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud updated ARROW-2162:
-
Description: 
>From GitHub:

This works as expected:

{code}
>>> pyarrow.array([decimal.Decimal('1.23')], pyarrow.decimal128(10,2))[0]
Decimal('1.23')
{code}

Storing an extra digit of precision multiplies the stored value by a factor of 
100:

{code}
>>> pyarrow.array([decimal.Decimal('1.234')], pyarrow.decimal128(10,2))[0]
Decimal('123.40')
{code}

Ideally I would get an exception since the value I'm trying to store doesn't 
fit in the declared type of the array. It would be less good, but still ok, if 
the stored value were 1.23 (truncating the extra digit). I didn't expect 
pyarrow to silently store a value that differs from the original value by a 
factor of 100.

I originally thought that the code was incorrectly multiplying through by an 
extra factor of 10**scale, but that doesn't seem to be the case. If I change 
the scale, it always seems to be a factor of 100

{code}
>>> pyarrow.array([decimal.Decimal('1.2345')], pyarrow.decimal128(10,3))[0]
Decimal('123.450')
I see the same behavior if I use floating point to initialize the array rather 
than Python's decimal type.
{code}

I searched for open github and JIRA for open issues but didn't find anything 
related to this. I am using pyarrow 0.8.0 on OS X 10.12.6 using python 2.7.14 
installed via Homebrew

> [Python/C++] Decimal Values with too-high precision are multiplied by 100
> -
>
> Key: ARROW-2162
> URL: https://issues.apache.org/jira/browse/ARROW-2162
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
> Fix For: 0.9.0
>
>
> From GitHub:
> This works as expected:
> {code}
> >>> pyarrow.array([decimal.Decimal('1.23')], pyarrow.decimal128(10,2))[0]
> Decimal('1.23')
> {code}
> Storing an extra digit of precision multiplies the stored value by a factor 
> of 100:
> {code}
> >>> pyarrow.array([decimal.Decimal('1.234')], pyarrow.decimal128(10,2))[0]
> Decimal('123.40')
> {code}
> Ideally I would get an exception since the value I'm trying to store doesn't 
> fit in the declared type of the array. It would be less good, but still ok, 
> if the stored value were 1.23 (truncating the extra digit). I didn't expect 
> pyarrow to silently store a value that differs from the original value by a 
> factor of 100.
> I originally thought that the code was incorrectly multiplying through by an 
> extra factor of 10**scale, but that doesn't seem to be the case. If I change 
> the scale, it always seems to be a factor of 100
> {code}
> >>> pyarrow.array([decimal.Decimal('1.2345')], pyarrow.decimal128(10,3))[0]
> Decimal('123.450')
> I see the same behavior if I use floating point to initialize the array 
> rather than Python's decimal type.
> {code}
> I searched for open github and JIRA for open issues but didn't find anything 
> related to this. I am using pyarrow 0.8.0 on OS X 10.12.6 using python 2.7.14 
> installed via Homebrew



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2164) [C++] Clean up unnecessary decimal module refs

2018-02-15 Thread Phillip Cloud (JIRA)
Phillip Cloud created ARROW-2164:


 Summary: [C++] Clean up unnecessary decimal module refs
 Key: ARROW-2164
 URL: https://issues.apache.org/jira/browse/ARROW-2164
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.8.0
Reporter: Phillip Cloud
Assignee: Phillip Cloud
 Fix For: 0.9.0


See this comment: 
https://github.com/apache/arrow/pull/1610#discussion_r168533239



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2164) [C++] Clean up unnecessary decimal module refs

2018-02-15 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud updated ARROW-2164:
-
Fix Version/s: (was: 0.9.0)

> [C++] Clean up unnecessary decimal module refs
> --
>
> Key: ARROW-2164
> URL: https://issues.apache.org/jira/browse/ARROW-2164
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>
> See this comment: 
> https://github.com/apache/arrow/pull/1610#discussion_r168533239



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2161) [Python] test_cython_api failing for a build_ext --inplace install

2018-02-15 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366254#comment-16366254
 ] 

Phillip Cloud commented on ARROW-2161:
--

Yep, I have scripts that automate setting envars and splitting tmux screens :)

Still fails with that set

> [Python] test_cython_api failing for a build_ext --inplace install
> --
>
> Key: ARROW-2161
> URL: https://issues.apache.org/jira/browse/ARROW-2161
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Priority: Major
>
> {code}
> pytest pyarrow -x --tb=short
> = test session starts 
> =
> platform linux -- Python 3.6.3, pytest-3.3.1, py-1.5.2, pluggy-0.6.0
> rootdir: /home/phillip/Documents/code/cpp/arrow/python, inifile: setup.cfg
> collected 580 items
> pyarrow/tests/test_array.py 
> ...   
>   [ 10%]
> pyarrow/tests/test_convert_builtin.py 
> 
>   [ 24%]
> pyarrow/tests/test_convert_pandas.py 
> ...x...s..
>  [ 38%]
> . 
>   [ 41%]
> pyarrow/tests/test_cython.py F
> == FAILURES 
> ===
> ___ test_cython_api 
> ___
> pyarrow/tests/test_cython.py:88: in test_cython_api
> 'build_ext', '--inplace'])
> /home/phillip/miniconda3/envs/pyarrow36/lib/python3.6/subprocess.py:291: in 
> check_call
> raise CalledProcessError(retcode, cmd)
> E   subprocess.CalledProcessError: Command 
> '['/home/phillip/miniconda3/envs/pyarrow36/bin/python', 'setup.py', 
> 'build_ext', '--inplace']' returned non-zero exit status 1.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2161) [Python] test_cython_api failing for a build_ext --inplace install

2018-02-15 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366310#comment-16366310
 ] 

Phillip Cloud commented on ARROW-2161:
--

Ok, this seems to be working now for reasons I don't quite understand. I'll put 
up a PR to skip if ARROW_HOME isn't defined though since that's an actual 
buglet :)

> [Python] test_cython_api failing for a build_ext --inplace install
> --
>
> Key: ARROW-2161
> URL: https://issues.apache.org/jira/browse/ARROW-2161
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Priority: Major
>
> {code}
> pytest pyarrow -x --tb=short
> = test session starts 
> =
> platform linux -- Python 3.6.3, pytest-3.3.1, py-1.5.2, pluggy-0.6.0
> rootdir: /home/phillip/Documents/code/cpp/arrow/python, inifile: setup.cfg
> collected 580 items
> pyarrow/tests/test_array.py 
> ...   
>   [ 10%]
> pyarrow/tests/test_convert_builtin.py 
> 
>   [ 24%]
> pyarrow/tests/test_convert_pandas.py 
> ...x...s..
>  [ 38%]
> . 
>   [ 41%]
> pyarrow/tests/test_cython.py F
> == FAILURES 
> ===
> ___ test_cython_api 
> ___
> pyarrow/tests/test_cython.py:88: in test_cython_api
> 'build_ext', '--inplace'])
> /home/phillip/miniconda3/envs/pyarrow36/lib/python3.6/subprocess.py:291: in 
> check_call
> raise CalledProcessError(retcode, cmd)
> E   subprocess.CalledProcessError: Command 
> '['/home/phillip/miniconda3/envs/pyarrow36/bin/python', 'setup.py', 
> 'build_ext', '--inplace']' returned non-zero exit status 1.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2161) [Python] Skip test_cython_api if ARROW_HOME isn't defined

2018-02-15 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud updated ARROW-2161:
-
Summary: [Python] Skip test_cython_api if ARROW_HOME isn't defined  (was: 
[Python] test_cython_api failing for a build_ext --inplace install)

> [Python] Skip test_cython_api if ARROW_HOME isn't defined
> -
>
> Key: ARROW-2161
> URL: https://issues.apache.org/jira/browse/ARROW-2161
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Priority: Major
> Fix For: 0.9.0
>
>
> {code}
> pytest pyarrow -x --tb=short
> = test session starts 
> =
> platform linux -- Python 3.6.3, pytest-3.3.1, py-1.5.2, pluggy-0.6.0
> rootdir: /home/phillip/Documents/code/cpp/arrow/python, inifile: setup.cfg
> collected 580 items
> pyarrow/tests/test_array.py 
> ...   
>   [ 10%]
> pyarrow/tests/test_convert_builtin.py 
> 
>   [ 24%]
> pyarrow/tests/test_convert_pandas.py 
> ...x...s..
>  [ 38%]
> . 
>   [ 41%]
> pyarrow/tests/test_cython.py F
> == FAILURES 
> ===
> ___ test_cython_api 
> ___
> pyarrow/tests/test_cython.py:88: in test_cython_api
> 'build_ext', '--inplace'])
> /home/phillip/miniconda3/envs/pyarrow36/lib/python3.6/subprocess.py:291: in 
> check_call
> raise CalledProcessError(retcode, cmd)
> E   subprocess.CalledProcessError: Command 
> '['/home/phillip/miniconda3/envs/pyarrow36/bin/python', 'setup.py', 
> 'build_ext', '--inplace']' returned non-zero exit status 1.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2161) [Python] Skip test_cython_api if ARROW_HOME isn't defined

2018-02-15 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud updated ARROW-2161:
-
Fix Version/s: 0.9.0

> [Python] Skip test_cython_api if ARROW_HOME isn't defined
> -
>
> Key: ARROW-2161
> URL: https://issues.apache.org/jira/browse/ARROW-2161
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Priority: Major
> Fix For: 0.9.0
>
>
> {code}
> pytest pyarrow -x --tb=short
> = test session starts 
> =
> platform linux -- Python 3.6.3, pytest-3.3.1, py-1.5.2, pluggy-0.6.0
> rootdir: /home/phillip/Documents/code/cpp/arrow/python, inifile: setup.cfg
> collected 580 items
> pyarrow/tests/test_array.py 
> ...   
>   [ 10%]
> pyarrow/tests/test_convert_builtin.py 
> 
>   [ 24%]
> pyarrow/tests/test_convert_pandas.py 
> ...x...s..
>  [ 38%]
> . 
>   [ 41%]
> pyarrow/tests/test_cython.py F
> == FAILURES 
> ===
> ___ test_cython_api 
> ___
> pyarrow/tests/test_cython.py:88: in test_cython_api
> 'build_ext', '--inplace'])
> /home/phillip/miniconda3/envs/pyarrow36/lib/python3.6/subprocess.py:291: in 
> check_call
> raise CalledProcessError(retcode, cmd)
> E   subprocess.CalledProcessError: Command 
> '['/home/phillip/miniconda3/envs/pyarrow36/bin/python', 'setup.py', 
> 'build_ext', '--inplace']' returned non-zero exit status 1.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2161) [Python] Skip test_cython_api if ARROW_HOME isn't defined

2018-02-15 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud reassigned ARROW-2161:


Assignee: Phillip Cloud

> [Python] Skip test_cython_api if ARROW_HOME isn't defined
> -
>
> Key: ARROW-2161
> URL: https://issues.apache.org/jira/browse/ARROW-2161
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
> Fix For: 0.9.0
>
>
> {code}
> pytest pyarrow -x --tb=short
> = test session starts 
> =
> platform linux -- Python 3.6.3, pytest-3.3.1, py-1.5.2, pluggy-0.6.0
> rootdir: /home/phillip/Documents/code/cpp/arrow/python, inifile: setup.cfg
> collected 580 items
> pyarrow/tests/test_array.py 
> ...   
>   [ 10%]
> pyarrow/tests/test_convert_builtin.py 
> 
>   [ 24%]
> pyarrow/tests/test_convert_pandas.py 
> ...x...s..
>  [ 38%]
> . 
>   [ 41%]
> pyarrow/tests/test_cython.py F
> == FAILURES 
> ===
> ___ test_cython_api 
> ___
> pyarrow/tests/test_cython.py:88: in test_cython_api
> 'build_ext', '--inplace'])
> /home/phillip/miniconda3/envs/pyarrow36/lib/python3.6/subprocess.py:291: in 
> check_call
> raise CalledProcessError(retcode, cmd)
> E   subprocess.CalledProcessError: Command 
> '['/home/phillip/miniconda3/envs/pyarrow36/bin/python', 'setup.py', 
> 'build_ext', '--inplace']' returned non-zero exit status 1.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2161) [Python] Skip test_cython_api if ARROW_HOME isn't defined

2018-02-15 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366336#comment-16366336
 ] 

Phillip Cloud commented on ARROW-2161:
--

Ah this is working because I ran `python setup.py develop` which puts `pyarrow` 
on the path so it can be imported.

> [Python] Skip test_cython_api if ARROW_HOME isn't defined
> -
>
> Key: ARROW-2161
> URL: https://issues.apache.org/jira/browse/ARROW-2161
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
> Fix For: 0.9.0
>
>
> {code}
> pytest pyarrow -x --tb=short
> = test session starts 
> =
> platform linux -- Python 3.6.3, pytest-3.3.1, py-1.5.2, pluggy-0.6.0
> rootdir: /home/phillip/Documents/code/cpp/arrow/python, inifile: setup.cfg
> collected 580 items
> pyarrow/tests/test_array.py 
> ...   
>   [ 10%]
> pyarrow/tests/test_convert_builtin.py 
> 
>   [ 24%]
> pyarrow/tests/test_convert_pandas.py 
> ...x...s..
>  [ 38%]
> . 
>   [ 41%]
> pyarrow/tests/test_cython.py F
> == FAILURES 
> ===
> ___ test_cython_api 
> ___
> pyarrow/tests/test_cython.py:88: in test_cython_api
> 'build_ext', '--inplace'])
> /home/phillip/miniconda3/envs/pyarrow36/lib/python3.6/subprocess.py:291: in 
> check_call
> raise CalledProcessError(retcode, cmd)
> E   subprocess.CalledProcessError: Command 
> '['/home/phillip/miniconda3/envs/pyarrow36/bin/python', 'setup.py', 
> 'build_ext', '--inplace']' returned non-zero exit status 1.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2167) [C++] Building Orc extensions fails with the default BUILD_WARNING_LEVEL=Production

2018-02-16 Thread Phillip Cloud (JIRA)
Phillip Cloud created ARROW-2167:


 Summary: [C++] Building Orc extensions fails with the default 
BUILD_WARNING_LEVEL=Production
 Key: ARROW-2167
 URL: https://issues.apache.org/jira/browse/ARROW-2167
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.8.0
Reporter: Phillip Cloud


Building orc_ep fails because there are a bunch of upstream warnings like not 
providing {{override}} on virtual destructor subclasses, and using {{0}} as the 
{{nullptr}} constant and the default {{BUILD_WARNING_LEVEL}} is {{Production}} 
which includes {{-Wall}} (all warnings as errors).

I see that there are different possible options for {{BUILD_WARNING_LEVEL}} so 
it's possible for developers to deal with this issue.

It seems easier to let EPs build with whatever the default warning level is for 
the project rather than force our defaults on those projects.

Generally speaking, are we using our own CXX_FLAGS for EPs other than Orc?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2167) [C++] Building Orc extensions fails with the default BUILD_WARNING_LEVEL=Production

2018-02-16 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367476#comment-16367476
 ] 

Phillip Cloud commented on ARROW-2167:
--

ping [~jim.crist]

> [C++] Building Orc extensions fails with the default 
> BUILD_WARNING_LEVEL=Production
> ---
>
> Key: ARROW-2167
> URL: https://issues.apache.org/jira/browse/ARROW-2167
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Priority: Major
>
> Building orc_ep fails because there are a bunch of upstream warnings like not 
> providing {{override}} on virtual destructor subclasses, and using {{0}} as 
> the {{nullptr}} constant and the default {{BUILD_WARNING_LEVEL}} is 
> {{Production}} which includes {{-Wall}} (all warnings as errors).
> I see that there are different possible options for {{BUILD_WARNING_LEVEL}} 
> so it's possible for developers to deal with this issue.
> It seems easier to let EPs build with whatever the default warning level is 
> for the project rather than force our defaults on those projects.
> Generally speaking, are we using our own CXX_FLAGS for EPs other than Orc?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2167) [C++] Building Orc extensions fails with the default BUILD_WARNING_LEVEL=Production

2018-02-16 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367481#comment-16367481
 ] 

Phillip Cloud commented on ARROW-2167:
--

Actually, it looks like all of the other options are _more_ strict than 
{{Production}}.

> [C++] Building Orc extensions fails with the default 
> BUILD_WARNING_LEVEL=Production
> ---
>
> Key: ARROW-2167
> URL: https://issues.apache.org/jira/browse/ARROW-2167
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Priority: Major
>
> Building orc_ep fails because there are a bunch of upstream warnings like not 
> providing {{override}} on virtual destructor subclasses, and using {{0}} as 
> the {{nullptr}} constant and the default {{BUILD_WARNING_LEVEL}} is 
> {{Production}} which includes {{-Wall}} (all warnings as errors).
> I see that there are different possible options for {{BUILD_WARNING_LEVEL}} 
> so it's possible for developers to deal with this issue.
> It seems easier to let EPs build with whatever the default warning level is 
> for the project rather than force our defaults on those projects.
> Generally speaking, are we using our own CXX_FLAGS for EPs other than Orc?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2167) [C++] Building Orc extensions fails with the default BUILD_WARNING_LEVEL=Production

2018-02-16 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367766#comment-16367766
 ] 

Phillip Cloud commented on ARROW-2167:
--

Ok, I will take a closer look at that PR.

> [C++] Building Orc extensions fails with the default 
> BUILD_WARNING_LEVEL=Production
> ---
>
> Key: ARROW-2167
> URL: https://issues.apache.org/jira/browse/ARROW-2167
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Priority: Major
> Fix For: 0.9.0
>
>
> Building orc_ep fails because there are a bunch of upstream warnings like not 
> providing {{override}} on virtual destructor subclasses, and using {{0}} as 
> the {{nullptr}} constant and the default {{BUILD_WARNING_LEVEL}} is 
> {{Production}} which includes {{-Wall}} (all warnings as errors).
> I see that there are different possible options for {{BUILD_WARNING_LEVEL}} 
> so it's possible for developers to deal with this issue.
> It seems easier to let EPs build with whatever the default warning level is 
> for the project rather than force our defaults on those projects.
> Generally speaking, are we using our own CXX_FLAGS for EPs other than Orc?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2161) [Python] Skip test_cython_api if ARROW_HOME isn't defined

2018-02-16 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud resolved ARROW-2161.
--
Resolution: Fixed

Issue resolved by pull request 1615
[https://github.com/apache/arrow/pull/1615]

> [Python] Skip test_cython_api if ARROW_HOME isn't defined
> -
>
> Key: ARROW-2161
> URL: https://issues.apache.org/jira/browse/ARROW-2161
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {code}
> pytest pyarrow -x --tb=short
> = test session starts 
> =
> platform linux -- Python 3.6.3, pytest-3.3.1, py-1.5.2, pluggy-0.6.0
> rootdir: /home/phillip/Documents/code/cpp/arrow/python, inifile: setup.cfg
> collected 580 items
> pyarrow/tests/test_array.py 
> ...   
>   [ 10%]
> pyarrow/tests/test_convert_builtin.py 
> 
>   [ 24%]
> pyarrow/tests/test_convert_pandas.py 
> ...x...s..
>  [ 38%]
> . 
>   [ 41%]
> pyarrow/tests/test_cython.py F
> == FAILURES 
> ===
> ___ test_cython_api 
> ___
> pyarrow/tests/test_cython.py:88: in test_cython_api
> 'build_ext', '--inplace'])
> /home/phillip/miniconda3/envs/pyarrow36/lib/python3.6/subprocess.py:291: in 
> check_call
> raise CalledProcessError(retcode, cmd)
> E   subprocess.CalledProcessError: Command 
> '['/home/phillip/miniconda3/envs/pyarrow36/bin/python', 'setup.py', 
> 'build_ext', '--inplace']' returned non-zero exit status 1.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2160) [C++/Python] Fix decimal precision inference

2018-02-16 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud updated ARROW-2160:
-
Summary: [C++/Python]  Fix decimal precision inference  (was: [C++/Python]  
Decimal precision inference)

> [C++/Python]  Fix decimal precision inference
> -
>
> Key: ARROW-2160
> URL: https://issues.apache.org/jira/browse/ARROW-2160
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Assignee: Phillip Cloud
>Priority: Major
> Fix For: 0.9.0
>
>
> {code}
> import pyarrow as pa
> import pandas as pd
> import decimal
> df = pd.DataFrame({'a': [decimal.Decimal('0.1'), decimal.Decimal('0.01')]})
> pa.Table.from_pandas(df)
> {code}
> raises:
> {code}
> pyarrow.lib.ArrowInvalid: Decimal type with precision 2 does not fit into 
> precision inferred from first array element: 1
> {code}
> Looks arrow is inferring the highest precision for given column based on the 
> first cell and expecting the rest fits in. I understand this is by design but 
> from the point of view of pandas-arrow compatibility this is quite painful as 
> pandas is more flexible (as demonstrated).
> What this means is that user trying to pass pandas {{DataFrame}} with 
> {{Decimal}} column(s) to arrow {{Table}} would always have to first:
> # Find the highest precision used in (each of) that column(s)
> # Adjust the first cell of (each of) that column(s) so that it explicitly 
> uses the highest precision of that column(s)
> # Only then pass such {{DataFrame}} to {{Table.from_pandas()}}
> So given this unavoidable procedure (and assuming arrow needs to be strict 
> about the highest precision for a column) - shouldn't some similar logic be 
> part of the {{Table.from_pandas()}} directly to make this transparent?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2117) [C++] Pin clang to version 5.0

2018-02-16 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud resolved ARROW-2117.
--
   Resolution: Fixed
Fix Version/s: 0.9.0

Issue resolved by pull request 1597
[https://github.com/apache/arrow/pull/1597]

> [C++] Pin clang to version 5.0
> --
>
> Key: ARROW-2117
> URL: https://issues.apache.org/jira/browse/ARROW-2117
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.9.0
>Reporter: Phillip Cloud
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Let's do this after the next release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2162) [Python/C++] Decimal Values with too-high precision are multiplied by 100

2018-02-16 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367870#comment-16367870
 ] 

Phillip Cloud commented on ARROW-2162:
--

Taking a look at this.

> [Python/C++] Decimal Values with too-high precision are multiplied by 100
> -
>
> Key: ARROW-2162
> URL: https://issues.apache.org/jira/browse/ARROW-2162
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
> Fix For: 0.9.0
>
>
> From GitHub:
> This works as expected:
> {code}
> >>> pyarrow.array([decimal.Decimal('1.23')], pyarrow.decimal128(10,2))[0]
> Decimal('1.23')
> {code}
> Storing an extra digit of precision multiplies the stored value by a factor 
> of 100:
> {code}
> >>> pyarrow.array([decimal.Decimal('1.234')], pyarrow.decimal128(10,2))[0]
> Decimal('123.40')
> {code}
> Ideally I would get an exception since the value I'm trying to store doesn't 
> fit in the declared type of the array. It would be less good, but still ok, 
> if the stored value were 1.23 (truncating the extra digit). I didn't expect 
> pyarrow to silently store a value that differs from the original value by a 
> factor of 100.
> I originally thought that the code was incorrectly multiplying through by an 
> extra factor of 10**scale, but that doesn't seem to be the case. If I change 
> the scale, it always seems to be a factor of 100
> {code}
> >>> pyarrow.array([decimal.Decimal('1.2345')], pyarrow.decimal128(10,3))[0]
> Decimal('123.450')
> I see the same behavior if I use floating point to initialize the array 
> rather than Python's decimal type.
> {code}
> I searched for open github and JIRA for open issues but didn't find anything 
> related to this. I am using pyarrow 0.8.0 on OS X 10.12.6 using python 2.7.14 
> installed via Homebrew



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2169) [C++] MSVC is complaining about niter in io-file-test.cc and io-hdfs-test.cc

2018-02-17 Thread Phillip Cloud (JIRA)
Phillip Cloud created ARROW-2169:


 Summary: [C++] MSVC is complaining about niter in io-file-test.cc 
and io-hdfs-test.cc
 Key: ARROW-2169
 URL: https://issues.apache.org/jira/browse/ARROW-2169
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.8.0
Reporter: Phillip Cloud
Assignee: Phillip Cloud
 Fix For: 0.9.0


Fix up shortly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   3   4   5   >