[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395732#comment-16395732
 ] 

ASF GitHub Bot commented on ARROW-2135:
---

wesm commented on issue #1681: ARROW-2135: [Python] Fix NaN conversion when 
casting from Numpy array
URL: https://github.com/apache/arrow/pull/1681#issuecomment-372427204
 
 
   see ARROW-2298 for adding an option about NaN conversions


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] NaN values silently casted to int64 when passing explicit schema for 
> conversion in Table.from_pandas
> -
>
> Key: ARROW-2135
> URL: https://issues.apache.org/jira/browse/ARROW-2135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the 
> NaN is improperly cast. Since pandas casts these to floats, when converted to 
> a table the NaN is interpreted as an integer. This seems like a bug since a 
> known limitation in pandas (the inability to have null valued integers data) 
> is taking precedence over arrow's functionality to store these as an IntArray 
> with nulls.
>  
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1, 2, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.int64(), nullable=True)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1,
>   2,
>   -9223372036854775808
> ]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395731#comment-16395731
 ] 

ASF GitHub Bot commented on ARROW-2135:
---

wesm closed pull request #1681: ARROW-2135: [Python] Fix NaN conversion when 
casting from Numpy array
URL: https://github.com/apache/arrow/pull/1681
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/src/arrow/python/numpy-internal.h 
b/cpp/src/arrow/python/numpy-internal.h
index 8d4308065..7672861d4 100644
--- a/cpp/src/arrow/python/numpy-internal.h
+++ b/cpp/src/arrow/python/numpy-internal.h
@@ -68,6 +68,9 @@ class Ndarray1DIndexer {
   int64_t stride_;
 };
 
+// Handling of Numpy Types by their static numbers
+// (the NPY_TYPES enum and related defines)
+
 static inline std::string GetNumPyTypeName(int npy_type) {
 #define TYPE_CASE(TYPE, NAME) \
   case NPY_##TYPE:\
@@ -79,14 +82,20 @@ static inline std::string GetNumPyTypeName(int npy_type) {
 TYPE_CASE(INT16, "int16")
 TYPE_CASE(INT32, "int32")
 TYPE_CASE(INT64, "int64")
-#if (NPY_INT64 != NPY_LONGLONG)
+#if !NPY_INT32_IS_INT
+TYPE_CASE(INT, "intc")
+#endif
+#if !NPY_INT64_IS_LONG_LONG
 TYPE_CASE(LONGLONG, "longlong")
 #endif
 TYPE_CASE(UINT8, "uint8")
 TYPE_CASE(UINT16, "uint16")
 TYPE_CASE(UINT32, "uint32")
 TYPE_CASE(UINT64, "uint64")
-#if (NPY_UINT64 != NPY_ULONGLONG)
+#if !NPY_INT32_IS_INT
+TYPE_CASE(UINT, "uintc")
+#endif
+#if !NPY_INT64_IS_LONG_LONG
 TYPE_CASE(ULONGLONG, "ulonglong")
 #endif
 TYPE_CASE(FLOAT16, "float16")
@@ -100,9 +109,48 @@ static inline std::string GetNumPyTypeName(int npy_type) {
   }
 
 #undef TYPE_CASE
-  return "unrecognized type in GetNumPyTypeName";
+  std::stringstream ss;
+  ss << "unrecognized type (" << npy_type << ") in GetNumPyTypeName";
+  return ss.str();
 }
 
+#define TYPE_VISIT_INLINE(TYPE) \
+  case NPY_##TYPE:  \
+return visitor->template Visit(arr);
+
+template 
+inline Status VisitNumpyArrayInline(PyArrayObject* arr, VISITOR* visitor) {
+  switch (PyArray_TYPE(arr)) {
+TYPE_VISIT_INLINE(BOOL);
+TYPE_VISIT_INLINE(INT8);
+TYPE_VISIT_INLINE(UINT8);
+TYPE_VISIT_INLINE(INT16);
+TYPE_VISIT_INLINE(UINT16);
+TYPE_VISIT_INLINE(INT32);
+TYPE_VISIT_INLINE(UINT32);
+TYPE_VISIT_INLINE(INT64);
+TYPE_VISIT_INLINE(UINT64);
+#if !NPY_INT32_IS_INT
+TYPE_VISIT_INLINE(INT);
+TYPE_VISIT_INLINE(UINT);
+#endif
+#if !NPY_INT64_IS_LONG_LONG
+TYPE_VISIT_INLINE(LONGLONG);
+TYPE_VISIT_INLINE(ULONGLONG);
+#endif
+TYPE_VISIT_INLINE(FLOAT16);
+TYPE_VISIT_INLINE(FLOAT32);
+TYPE_VISIT_INLINE(FLOAT64);
+TYPE_VISIT_INLINE(DATETIME);
+TYPE_VISIT_INLINE(OBJECT);
+  }
+  std::stringstream ss;
+  ss << "NumPy type not implemented: " << GetNumPyTypeName(PyArray_TYPE(arr));
+  return Status::NotImplemented(ss.str());
+}
+
+#undef TYPE_VISIT_INLINE
+
 }  // namespace py
 }  // namespace arrow
 
diff --git a/cpp/src/arrow/python/numpy_interop.h 
b/cpp/src/arrow/python/numpy_interop.h
index 8c569e232..0715c66c5 100644
--- a/cpp/src/arrow/python/numpy_interop.h
+++ b/cpp/src/arrow/python/numpy_interop.h
@@ -43,6 +43,31 @@
 #include 
 #include 
 
+// A bit subtle. Numpy has 5 canonical integer types:
+// (or, rather, type pairs: signed and unsigned)
+//   NPY_BYTE, NPY_SHORT, NPY_INT, NPY_LONG, NPY_LONGLONG
+// It also has 4 fixed-width integer aliases.
+// When mapping Arrow integer types to these 4 fixed-width aliases,
+// we always miss one of the canonical types (even though it may
+// have the same width as one of the aliases).
+// Which one depends on the platform...
+// On a LP64 system, NPY_INT64 maps to NPY_LONG and
+// NPY_LONGLONG needs to be handled separately.
+// On a LLP64 system, NPY_INT32 maps to NPY_LONG and
+// NPY_INT needs to be handled separately.
+
+#if NPY_BITSOF_LONG == 32 && NPY_BITSOF_LONGLONG == 64
+#define NPY_INT64_IS_LONG_LONG 1
+#else
+#define NPY_INT64_IS_LONG_LONG 0
+#endif
+
+#if NPY_BITSOF_INT == 32 && NPY_BITSOF_LONG == 64
+#define NPY_INT32_IS_INT 1
+#else
+#define NPY_INT32_IS_INT 0
+#endif
+
 namespace arrow {
 namespace py {
 
diff --git a/cpp/src/arrow/python/numpy_to_arrow.cc 
b/cpp/src/arrow/python/numpy_to_arrow.cc
index 04a71c1f6..6ddc4a7be 100644
--- a/cpp/src/arrow/python/numpy_to_arrow.cc
+++ b/cpp/src/arrow/python/numpy_to_arrow.cc
@@ -84,6 +84,38 @@ inline bool PyObject_is_integer(PyObject* obj) {
   return !PyBool_Check(obj) && PyArray_IsIntegerScalar(obj);
 }
 
+Status CheckFlatNumpyArray(PyArrayObject* numpy_array, int np_type) {
+  if (PyArray_NDIM(numpy_array) != 1) {
+return Status::Invalid("only handle 1-dimensional arrays");
+  }
+
+  const int 

[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391241#comment-16391241
 ] 

ASF GitHub Bot commented on ARROW-2135:
---

pitrou commented on issue #1681: ARROW-2135: [Python] Fix NaN conversion when 
casting from Numpy array
URL: https://github.com/apache/arrow/pull/1681#issuecomment-371484306
 
 
   AppVeyor at https://ci.appveyor.com/project/pitrou/arrow/build/1.0.175


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] NaN values silently casted to int64 when passing explicit schema for 
> conversion in Table.from_pandas
> -
>
> Key: ARROW-2135
> URL: https://issues.apache.org/jira/browse/ARROW-2135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the 
> NaN is improperly cast. Since pandas casts these to floats, when converted to 
> a table the NaN is interpreted as an integer. This seems like a bug since a 
> known limitation in pandas (the inability to have null valued integers data) 
> is taking precedence over arrow's functionality to store these as an IntArray 
> with nulls.
>  
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1, 2, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.int64(), nullable=True)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1,
>   2,
>   -9223372036854775808
> ]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391179#comment-16391179
 ] 

ASF GitHub Bot commented on ARROW-2135:
---

pitrou commented on issue #1681: ARROW-2135: [Python] Fix NaN conversion when 
casting from Numpy array
URL: https://github.com/apache/arrow/pull/1681#issuecomment-371474232
 
 
   Rebased.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] NaN values silently casted to int64 when passing explicit schema for 
> conversion in Table.from_pandas
> -
>
> Key: ARROW-2135
> URL: https://issues.apache.org/jira/browse/ARROW-2135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the 
> NaN is improperly cast. Since pandas casts these to floats, when converted to 
> a table the NaN is interpreted as an integer. This seems like a bug since a 
> known limitation in pandas (the inability to have null valued integers data) 
> is taking precedence over arrow's functionality to store these as an IntArray 
> with nulls.
>  
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1, 2, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.int64(), nullable=True)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1,
>   2,
>   -9223372036854775808
> ]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382750#comment-16382750
 ] 

ASF GitHub Bot commented on ARROW-2135:
---

cpcloud commented on a change in pull request #1681: ARROW-2135: [Python] Fix 
NaN conversion when casting from Numpy array
URL: https://github.com/apache/arrow/pull/1681#discussion_r171711960
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -113,6 +145,55 @@ inline int64_t ValuesToBitmap(PyArrayObject* arr, 
uint8_t* bitmap) {
   return null_count;
 }
 
+class NumPyNullsConverter {
+ public:
+  /// Convert the given array's null values to a null bitmap.
+  /// The null bitmap is only allocated if null values are ever possible.
+  static Status Convert(MemoryPool* pool, PyArrayObject* arr,
+bool use_pandas_null_sentinels,
+std::shared_ptr* out_null_bitmap_,
+int64_t* out_null_count) {
+NumPyNullsConverter converter(pool, arr, use_pandas_null_sentinels);
+RETURN_NOT_OK(VisitNumpyArrayInline(arr, ));
+*out_null_bitmap_ = converter.null_bitmap_;
+*out_null_count = converter.null_count_;
+return Status::OK();
+  }
+
+  template 
+  Status Visit(PyArrayObject* arr) {
+typedef internal::npy_traits traits;
+
+const bool null_sentinels_possible =
+// Always treat Numpy's NaT as null
+TYPE == NPY_DATETIME ||
 
 Review comment:
   AFAIU There's no other way to interpret `NaT` other than `NULL` (unless 
there's a standard that defines it in a different way than "missing"). nan is 
part of the IEEE floating point specification (as I'm sure you know) and it has 
a different meaning than null.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] NaN values silently casted to int64 when passing explicit schema for 
> conversion in Table.from_pandas
> -
>
> Key: ARROW-2135
> URL: https://issues.apache.org/jira/browse/ARROW-2135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the 
> NaN is improperly cast. Since pandas casts these to floats, when converted to 
> a table the NaN is interpreted as an integer. This seems like a bug since a 
> known limitation in pandas (the inability to have null valued integers data) 
> is taking precedence over arrow's functionality to store these as an IntArray 
> with nulls.
>  
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1, 2, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.int64(), nullable=True)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1,
>   2,
>   -9223372036854775808
> ]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382735#comment-16382735
 ] 

ASF GitHub Bot commented on ARROW-2135:
---

cpcloud commented on a change in pull request #1681: ARROW-2135: [Python] Fix 
NaN conversion when casting from Numpy array
URL: https://github.com/apache/arrow/pull/1681#discussion_r171710346
 
 

 ##
 File path: python/pyarrow/tests/test_convert_pandas.py
 ##
 @@ -501,6 +501,14 @@ def test_float_nulls(self):
 result = table.to_pandas()
 tm.assert_frame_equal(result, ex_frame)
 
+def test_float_nulls_to_ints(self):
+# ARROW-2135
+df = pd.DataFrame({"a": [1.0, 2.0, pd.np.NaN]})
+schema = pa.schema([pa.field("a", pa.int16(), nullable=True)])
+table = pa.Table.from_pandas(df, schema=schema)
+assert table[0].to_pylist() == [1, 2, None]
+tm.assert_frame_equal(df, table.to_pandas())
 
 Review comment:
   That's fine. Was just wondering.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] NaN values silently casted to int64 when passing explicit schema for 
> conversion in Table.from_pandas
> -
>
> Key: ARROW-2135
> URL: https://issues.apache.org/jira/browse/ARROW-2135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the 
> NaN is improperly cast. Since pandas casts these to floats, when converted to 
> a table the NaN is interpreted as an integer. This seems like a bug since a 
> known limitation in pandas (the inability to have null valued integers data) 
> is taking precedence over arrow's functionality to store these as an IntArray 
> with nulls.
>  
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1, 2, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.int64(), nullable=True)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1,
>   2,
>   -9223372036854775808
> ]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382734#comment-16382734
 ] 

ASF GitHub Bot commented on ARROW-2135:
---

cpcloud commented on a change in pull request #1681: ARROW-2135: [Python] Fix 
NaN conversion when casting from Numpy array
URL: https://github.com/apache/arrow/pull/1681#discussion_r171710263
 
 

 ##
 File path: python/pyarrow/tests/test_convert_pandas.py
 ##
 @@ -501,6 +501,14 @@ def test_float_nulls(self):
 result = table.to_pandas()
 tm.assert_frame_equal(result, ex_frame)
 
+def test_float_nulls_to_ints(self):
+# ARROW-2135
+df = pd.DataFrame({"a": [1.0, 2.0, pd.np.NaN]})
+schema = pa.schema([pa.field("a", pa.int16(), nullable=True)])
+table = pa.Table.from_pandas(df, schema=schema)
+assert table[0].to_pylist() == [1, 2, None]
+tm.assert_frame_equal(df, table.to_pandas())
 
 Review comment:
   It looks like it's a hard cast:
   
   ```
   In [7]: pa.array([1, 2, 3.190, np.nan], type=pa.int64())
   Out[6]:
   
   [
 1,
 2,
 3,
 NA
   ]
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] NaN values silently casted to int64 when passing explicit schema for 
> conversion in Table.from_pandas
> -
>
> Key: ARROW-2135
> URL: https://issues.apache.org/jira/browse/ARROW-2135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the 
> NaN is improperly cast. Since pandas casts these to floats, when converted to 
> a table the NaN is interpreted as an integer. This seems like a bug since a 
> known limitation in pandas (the inability to have null valued integers data) 
> is taking precedence over arrow's functionality to store these as an IntArray 
> with nulls.
>  
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1, 2, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.int64(), nullable=True)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1,
>   2,
>   -9223372036854775808
> ]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382205#comment-16382205
 ] 

ASF GitHub Bot commented on ARROW-2135:
---

pitrou commented on issue #1681: ARROW-2135: [Python] Fix NaN conversion when 
casting from Numpy array
URL: https://github.com/apache/arrow/pull/1681#issuecomment-369636633
 
 
   AppVeyor build at https://ci.appveyor.com/project/pitrou/arrow/build/1.0.157


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] NaN values silently casted to int64 when passing explicit schema for 
> conversion in Table.from_pandas
> -
>
> Key: ARROW-2135
> URL: https://issues.apache.org/jira/browse/ARROW-2135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the 
> NaN is improperly cast. Since pandas casts these to floats, when converted to 
> a table the NaN is interpreted as an integer. This seems like a bug since a 
> known limitation in pandas (the inability to have null valued integers data) 
> is taking precedence over arrow's functionality to store these as an IntArray 
> with nulls.
>  
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1, 2, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.int64(), nullable=True)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1,
>   2,
>   -9223372036854775808
> ]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381816#comment-16381816
 ] 

ASF GitHub Bot commented on ARROW-2135:
---

pitrou commented on issue #1681: ARROW-2135: [Python] Fix NaN conversion when 
casting from Numpy array
URL: https://github.com/apache/arrow/pull/1681#issuecomment-369552237
 
 
   I addressed some review comments now.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] NaN values silently casted to int64 when passing explicit schema for 
> conversion in Table.from_pandas
> -
>
> Key: ARROW-2135
> URL: https://issues.apache.org/jira/browse/ARROW-2135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the 
> NaN is improperly cast. Since pandas casts these to floats, when converted to 
> a table the NaN is interpreted as an integer. This seems like a bug since a 
> known limitation in pandas (the inability to have null valued integers data) 
> is taking precedence over arrow's functionality to store these as an IntArray 
> with nulls.
>  
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1, 2, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.int64(), nullable=True)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1,
>   2,
>   -9223372036854775808
> ]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381773#comment-16381773
 ] 

ASF GitHub Bot commented on ARROW-2135:
---

pitrou commented on a change in pull request #1681: ARROW-2135: [Python] Fix 
NaN conversion when casting from Numpy array
URL: https://github.com/apache/arrow/pull/1681#discussion_r171509916
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -113,6 +145,55 @@ inline int64_t ValuesToBitmap(PyArrayObject* arr, 
uint8_t* bitmap) {
   return null_count;
 }
 
+class NumPyNullsConverter {
+ public:
+  /// Convert the given array's null values to a null bitmap.
+  /// The null bitmap is only allocated if null values are ever possible.
+  static Status Convert(MemoryPool* pool, PyArrayObject* arr,
+bool use_pandas_null_sentinels,
+std::shared_ptr* out_null_bitmap_,
+int64_t* out_null_count) {
+NumPyNullsConverter converter(pool, arr, use_pandas_null_sentinels);
+RETURN_NOT_OK(VisitNumpyArrayInline(arr, ));
+*out_null_bitmap_ = converter.null_bitmap_;
+*out_null_count = converter.null_count_;
+return Status::OK();
+  }
+
+  template 
+  Status Visit(PyArrayObject* arr) {
+typedef internal::npy_traits traits;
+
+const bool null_sentinels_possible =
+// Always treat Numpy's NaT as null
+TYPE == NPY_DATETIME ||
 
 Review comment:
   By the way, I don't know what that is, but this is required to have the 
tests pass. Why do we always treat NaT as null but not floating-point NaN? 
@wesm 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] NaN values silently casted to int64 when passing explicit schema for 
> conversion in Table.from_pandas
> -
>
> Key: ARROW-2135
> URL: https://issues.apache.org/jira/browse/ARROW-2135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the 
> NaN is improperly cast. Since pandas casts these to floats, when converted to 
> a table the NaN is interpreted as an integer. This seems like a bug since a 
> known limitation in pandas (the inability to have null valued integers data) 
> is taking precedence over arrow's functionality to store these as an IntArray 
> with nulls.
>  
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1, 2, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.int64(), nullable=True)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1,
>   2,
>   -9223372036854775808
> ]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381742#comment-16381742
 ] 

ASF GitHub Bot commented on ARROW-2135:
---

pitrou commented on a change in pull request #1681: ARROW-2135: [Python] Fix 
NaN conversion when casting from Numpy array
URL: https://github.com/apache/arrow/pull/1681#discussion_r171503669
 
 

 ##
 File path: cpp/src/arrow/python/type_traits.h
 ##
 @@ -127,8 +134,14 @@ template <>
 struct npy_traits {
   typedef PyObject* value_type;
   static constexpr bool supports_nulls = true;
+
+  static inline bool isnull(PyObject* v) { return v != Py_None; }
 
 Review comment:
   Nice catch :-) I'm not sure how to test it. Defining `isnull` is necessary 
for compiling, but that path isn't taken at runtime as object arrays are 
handled separately.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] NaN values silently casted to int64 when passing explicit schema for 
> conversion in Table.from_pandas
> -
>
> Key: ARROW-2135
> URL: https://issues.apache.org/jira/browse/ARROW-2135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the 
> NaN is improperly cast. Since pandas casts these to floats, when converted to 
> a table the NaN is interpreted as an integer. This seems like a bug since a 
> known limitation in pandas (the inability to have null valued integers data) 
> is taking precedence over arrow's functionality to store these as an IntArray 
> with nulls.
>  
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1, 2, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.int64(), nullable=True)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1,
>   2,
>   -9223372036854775808
> ]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381228#comment-16381228
 ] 

ASF GitHub Bot commented on ARROW-2135:
---

cpcloud commented on a change in pull request #1681: ARROW-2135: [Python] Fix 
NaN conversion when casting from Numpy array
URL: https://github.com/apache/arrow/pull/1681#discussion_r171420485
 
 

 ##
 File path: cpp/src/arrow/python/type_traits.h
 ##
 @@ -127,8 +134,14 @@ template <>
 struct npy_traits {
   typedef PyObject* value_type;
   static constexpr bool supports_nulls = true;
+
+  static inline bool isnull(PyObject* v) { return v != Py_None; }
 
 Review comment:
   Probably needs a test as well since it isn't failing.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] NaN values silently casted to int64 when passing explicit schema for 
> conversion in Table.from_pandas
> -
>
> Key: ARROW-2135
> URL: https://issues.apache.org/jira/browse/ARROW-2135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the 
> NaN is improperly cast. Since pandas casts these to floats, when converted to 
> a table the NaN is interpreted as an integer. This seems like a bug since a 
> known limitation in pandas (the inability to have null valued integers data) 
> is taking precedence over arrow's functionality to store these as an IntArray 
> with nulls.
>  
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1, 2, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.int64(), nullable=True)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1,
>   2,
>   -9223372036854775808
> ]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381087#comment-16381087
 ] 

ASF GitHub Bot commented on ARROW-2135:
---

cpcloud commented on a change in pull request #1681: ARROW-2135: [Python] Fix 
NaN conversion when casting from Numpy array
URL: https://github.com/apache/arrow/pull/1681#discussion_r171398041
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -113,6 +132,66 @@ inline int64_t ValuesToBitmap(PyArrayObject* arr, 
uint8_t* bitmap) {
   return null_count;
 }
 
+class NumPyNullsConverter {
+ public:
+  /// Convert the given array's null values to a null bitmap.
+  /// The null bitmap is only allocated if null values are ever possible.
+  static Status Convert(MemoryPool* pool, PyArrayObject* arr,
+bool use_pandas_null_sentinels,
+std::shared_ptr* out_null_bitmap_,
+int64_t* out_null_count) {
+NumPyNullsConverter converter(pool, arr, use_pandas_null_sentinels);
+RETURN_NOT_OK(VisitNumpyArrayInline(arr, ));
+*out_null_bitmap_ = converter.null_bitmap_;
+*out_null_count = converter.null_count_;
+return Status::OK();
+  }
+
+  template 
+  Status Visit(PyArrayObject* arr) {
+typedef internal::npy_traits traits;
+
+const bool null_sentinels_possible =
+// Observing pandas's null sentinels
+(use_pandas_null_sentinels_ && traits::supports_nulls);
+
+if (null_sentinels_possible) {
+  RETURN_NOT_OK(InitNullBitmap(PyArray_SIZE(arr)));
+  null_count_ = ValuesToBitmap(arr, null_bitmap_data_);
+}
+return Status::OK();
+  }
+
+  // XXX it's the same as NumPyConverter::InitNullBitmap()
+  Status InitNullBitmap(int64_t length) {
+int64_t null_bytes = BitUtil::BytesForBits(length);
+
+null_bitmap_ = std::make_shared(pool_);
+RETURN_NOT_OK(null_bitmap_->Resize(null_bytes));
+
+null_bitmap_data_ = null_bitmap_->mutable_data();
+memset(null_bitmap_data_, 0, static_cast(null_bytes));
 
 Review comment:
   Possibly time for a subclass then?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] NaN values silently casted to int64 when passing explicit schema for 
> conversion in Table.from_pandas
> -
>
> Key: ARROW-2135
> URL: https://issues.apache.org/jira/browse/ARROW-2135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the 
> NaN is improperly cast. Since pandas casts these to floats, when converted to 
> a table the NaN is interpreted as an integer. This seems like a bug since a 
> known limitation in pandas (the inability to have null valued integers data) 
> is taking precedence over arrow's functionality to store these as an IntArray 
> with nulls.
>  
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1, 2, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.int64(), nullable=True)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1,
>   2,
>   -9223372036854775808
> ]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381079#comment-16381079
 ] 

ASF GitHub Bot commented on ARROW-2135:
---

cpcloud commented on a change in pull request #1681: ARROW-2135: [Python] Fix 
NaN conversion when casting from Numpy array
URL: https://github.com/apache/arrow/pull/1681#discussion_r171397159
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -113,6 +132,66 @@ inline int64_t ValuesToBitmap(PyArrayObject* arr, 
uint8_t* bitmap) {
   return null_count;
 }
 
+class NumPyNullsConverter {
+ public:
+  /// Convert the given array's null values to a null bitmap.
+  /// The null bitmap is only allocated if null values are ever possible.
+  static Status Convert(MemoryPool* pool, PyArrayObject* arr,
+bool use_pandas_null_sentinels,
+std::shared_ptr* out_null_bitmap_,
+int64_t* out_null_count) {
+NumPyNullsConverter converter(pool, arr, use_pandas_null_sentinels);
+RETURN_NOT_OK(VisitNumpyArrayInline(arr, ));
+*out_null_bitmap_ = converter.null_bitmap_;
+*out_null_count = converter.null_count_;
+return Status::OK();
+  }
+
+  template 
+  Status Visit(PyArrayObject* arr) {
+typedef internal::npy_traits traits;
+
+const bool null_sentinels_possible =
+// Observing pandas's null sentinels
+(use_pandas_null_sentinels_ && traits::supports_nulls);
+
+if (null_sentinels_possible) {
+  RETURN_NOT_OK(InitNullBitmap(PyArray_SIZE(arr)));
+  null_count_ = ValuesToBitmap(arr, null_bitmap_data_);
+}
+return Status::OK();
+  }
+
+  // XXX it's the same as NumPyConverter::InitNullBitmap()
+  Status InitNullBitmap(int64_t length) {
+int64_t null_bytes = BitUtil::BytesForBits(length);
+
+null_bitmap_ = std::make_shared(pool_);
+RETURN_NOT_OK(null_bitmap_->Resize(null_bytes));
+
+null_bitmap_data_ = null_bitmap_->mutable_data();
+memset(null_bitmap_data_, 0, static_cast(null_bytes));
+
+return Status::OK();
+  }
+
+ protected:
+  NumPyNullsConverter(MemoryPool* pool, PyArrayObject* arr,
+  bool use_pandas_null_sentinels)
+  : pool_(pool),
+arr_(arr),
+use_pandas_null_sentinels_(use_pandas_null_sentinels),
+null_bitmap_data_(nullptr),
+null_count_(0) {}
+
+  MemoryPool* pool_;
+  PyArrayObject* arr_;
+  bool use_pandas_null_sentinels_;
+  std::shared_ptr null_bitmap_;
+  uint8_t* null_bitmap_data_;
 
 Review comment:
   That's one, though I added `begin()`/`end()` for that in #1651.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] NaN values silently casted to int64 when passing explicit schema for 
> conversion in Table.from_pandas
> -
>
> Key: ARROW-2135
> URL: https://issues.apache.org/jira/browse/ARROW-2135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the 
> NaN is improperly cast. Since pandas casts these to floats, when converted to 
> a table the NaN is interpreted as an integer. This seems like a bug since a 
> known limitation in pandas (the inability to have null valued integers data) 
> is taking precedence over arrow's functionality to store these as an IntArray 
> with nulls.
>  
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1, 2, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.int64(), nullable=True)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1,
>   2,
>   -9223372036854775808
> ]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381081#comment-16381081
 ] 

ASF GitHub Bot commented on ARROW-2135:
---

cpcloud commented on a change in pull request #1681: ARROW-2135: [Python] Fix 
NaN conversion when casting from Numpy array
URL: https://github.com/apache/arrow/pull/1681#discussion_r171397391
 
 

 ##
 File path: cpp/src/arrow/python/type_traits.h
 ##
 @@ -127,8 +134,14 @@ template <>
 struct npy_traits {
   typedef PyObject* value_type;
   static constexpr bool supports_nulls = true;
+
+  static inline bool isnull(PyObject* v) { return v != Py_None; }
 
 Review comment:
   Hm, so that's also called `isnull`. Shouldn't that mean `v == Py_None`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] NaN values silently casted to int64 when passing explicit schema for 
> conversion in Table.from_pandas
> -
>
> Key: ARROW-2135
> URL: https://issues.apache.org/jira/browse/ARROW-2135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the 
> NaN is improperly cast. Since pandas casts these to floats, when converted to 
> a table the NaN is interpreted as an integer. This seems like a bug since a 
> known limitation in pandas (the inability to have null valued integers data) 
> is taking precedence over arrow's functionality to store these as an IntArray 
> with nulls.
>  
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1, 2, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.int64(), nullable=True)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1,
>   2,
>   -9223372036854775808
> ]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381023#comment-16381023
 ] 

ASF GitHub Bot commented on ARROW-2135:
---

pitrou commented on a change in pull request #1681: ARROW-2135: [Python] Fix 
NaN conversion when casting from Numpy array
URL: https://github.com/apache/arrow/pull/1681#discussion_r171379107
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -113,6 +132,66 @@ inline int64_t ValuesToBitmap(PyArrayObject* arr, 
uint8_t* bitmap) {
   return null_count;
 }
 
+class NumPyNullsConverter {
+ public:
+  /// Convert the given array's null values to a null bitmap.
+  /// The null bitmap is only allocated if null values are ever possible.
+  static Status Convert(MemoryPool* pool, PyArrayObject* arr,
+bool use_pandas_null_sentinels,
+std::shared_ptr* out_null_bitmap_,
+int64_t* out_null_count) {
+NumPyNullsConverter converter(pool, arr, use_pandas_null_sentinels);
+RETURN_NOT_OK(VisitNumpyArrayInline(arr, ));
+*out_null_bitmap_ = converter.null_bitmap_;
+*out_null_count = converter.null_count_;
+return Status::OK();
+  }
+
+  template 
+  Status Visit(PyArrayObject* arr) {
+typedef internal::npy_traits traits;
+
+const bool null_sentinels_possible =
+// Observing pandas's null sentinels
+(use_pandas_null_sentinels_ && traits::supports_nulls);
+
+if (null_sentinels_possible) {
+  RETURN_NOT_OK(InitNullBitmap(PyArray_SIZE(arr)));
+  null_count_ = ValuesToBitmap(arr, null_bitmap_data_);
+}
+return Status::OK();
+  }
+
+  // XXX it's the same as NumPyConverter::InitNullBitmap()
+  Status InitNullBitmap(int64_t length) {
+int64_t null_bytes = BitUtil::BytesForBits(length);
+
+null_bitmap_ = std::make_shared(pool_);
+RETURN_NOT_OK(null_bitmap_->Resize(null_bytes));
+
+null_bitmap_data_ = null_bitmap_->mutable_data();
+memset(null_bitmap_data_, 0, static_cast(null_bytes));
+
+return Status::OK();
+  }
+
+ protected:
+  NumPyNullsConverter(MemoryPool* pool, PyArrayObject* arr,
+  bool use_pandas_null_sentinels)
+  : pool_(pool),
+arr_(arr),
+use_pandas_null_sentinels_(use_pandas_null_sentinels),
+null_bitmap_data_(nullptr),
+null_count_(0) {}
+
+  MemoryPool* pool_;
+  PyArrayObject* arr_;
+  bool use_pandas_null_sentinels_;
+  std::shared_ptr null_bitmap_;
+  uint8_t* null_bitmap_data_;
 
 Review comment:
   Which iterators are you thinking about? Do you mean the ndarray 1d iterator?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] NaN values silently casted to int64 when passing explicit schema for 
> conversion in Table.from_pandas
> -
>
> Key: ARROW-2135
> URL: https://issues.apache.org/jira/browse/ARROW-2135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the 
> NaN is improperly cast. Since pandas casts these to floats, when converted to 
> a table the NaN is interpreted as an integer. This seems like a bug since a 
> known limitation in pandas (the inability to have null valued integers data) 
> is taking precedence over arrow's functionality to store these as an IntArray 
> with nulls.
>  
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1, 2, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.int64(), nullable=True)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1,
>   2,
>   -9223372036854775808
> ]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381004#comment-16381004
 ] 

ASF GitHub Bot commented on ARROW-2135:
---

pitrou commented on a change in pull request #1681: ARROW-2135: [Python] Fix 
NaN conversion when casting from Numpy array
URL: https://github.com/apache/arrow/pull/1681#discussion_r171375887
 
 

 ##
 File path: cpp/src/arrow/python/type_traits.h
 ##
 @@ -127,8 +134,14 @@ template <>
 struct npy_traits {
   typedef PyObject* value_type;
   static constexpr bool supports_nulls = true;
+
+  static inline bool isnull(PyObject* v) { return v != Py_None; }
 
 Review comment:
   I see. This is really using the same convention as the rest of the file, 
though.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] NaN values silently casted to int64 when passing explicit schema for 
> conversion in Table.from_pandas
> -
>
> Key: ARROW-2135
> URL: https://issues.apache.org/jira/browse/ARROW-2135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the 
> NaN is improperly cast. Since pandas casts these to floats, when converted to 
> a table the NaN is interpreted as an integer. This seems like a bug since a 
> known limitation in pandas (the inability to have null valued integers data) 
> is taking precedence over arrow's functionality to store these as an IntArray 
> with nulls.
>  
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1, 2, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.int64(), nullable=True)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1,
>   2,
>   -9223372036854775808
> ]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380998#comment-16380998
 ] 

ASF GitHub Bot commented on ARROW-2135:
---

cpcloud commented on a change in pull request #1681: ARROW-2135: [Python] Fix 
NaN conversion when casting from Numpy array
URL: https://github.com/apache/arrow/pull/1681#discussion_r171375361
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -113,6 +132,66 @@ inline int64_t ValuesToBitmap(PyArrayObject* arr, 
uint8_t* bitmap) {
   return null_count;
 }
 
+class NumPyNullsConverter {
+ public:
+  /// Convert the given array's null values to a null bitmap.
+  /// The null bitmap is only allocated if null values are ever possible.
+  static Status Convert(MemoryPool* pool, PyArrayObject* arr,
+bool use_pandas_null_sentinels,
+std::shared_ptr* out_null_bitmap_,
+int64_t* out_null_count) {
+NumPyNullsConverter converter(pool, arr, use_pandas_null_sentinels);
+RETURN_NOT_OK(VisitNumpyArrayInline(arr, ));
+*out_null_bitmap_ = converter.null_bitmap_;
+*out_null_count = converter.null_count_;
+return Status::OK();
+  }
+
+  template 
+  Status Visit(PyArrayObject* arr) {
+typedef internal::npy_traits traits;
+
+const bool null_sentinels_possible =
+// Observing pandas's null sentinels
+(use_pandas_null_sentinels_ && traits::supports_nulls);
+
+if (null_sentinels_possible) {
+  RETURN_NOT_OK(InitNullBitmap(PyArray_SIZE(arr)));
+  null_count_ = ValuesToBitmap(arr, null_bitmap_data_);
+}
+return Status::OK();
+  }
+
+  // XXX it's the same as NumPyConverter::InitNullBitmap()
+  Status InitNullBitmap(int64_t length) {
+int64_t null_bytes = BitUtil::BytesForBits(length);
+
+null_bitmap_ = std::make_shared(pool_);
+RETURN_NOT_OK(null_bitmap_->Resize(null_bytes));
+
+null_bitmap_data_ = null_bitmap_->mutable_data();
+memset(null_bitmap_data_, 0, static_cast(null_bytes));
+
+return Status::OK();
+  }
+
+ protected:
+  NumPyNullsConverter(MemoryPool* pool, PyArrayObject* arr,
+  bool use_pandas_null_sentinels)
+  : pool_(pool),
+arr_(arr),
+use_pandas_null_sentinels_(use_pandas_null_sentinels),
+null_bitmap_data_(nullptr),
+null_count_(0) {}
+
+  MemoryPool* pool_;
+  PyArrayObject* arr_;
+  bool use_pandas_null_sentinels_;
+  std::shared_ptr null_bitmap_;
+  uint8_t* null_bitmap_data_;
 
 Review comment:
   At some point we may want to have an STL-compatible view class that makes 
interacting with iterators constructs in the STL much easier. We have a lot of 
code that is manually handling iteration using a size/count and a buffer.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] NaN values silently casted to int64 when passing explicit schema for 
> conversion in Table.from_pandas
> -
>
> Key: ARROW-2135
> URL: https://issues.apache.org/jira/browse/ARROW-2135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the 
> NaN is improperly cast. Since pandas casts these to floats, when converted to 
> a table the NaN is interpreted as an integer. This seems like a bug since a 
> known limitation in pandas (the inability to have null valued integers data) 
> is taking precedence over arrow's functionality to store these as an IntArray 
> with nulls.
>  
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1, 2, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.int64(), nullable=True)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1,
>   2,
>   -9223372036854775808
> ]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380997#comment-16380997
 ] 

ASF GitHub Bot commented on ARROW-2135:
---

pitrou commented on a change in pull request #1681: ARROW-2135: [Python] Fix 
NaN conversion when casting from Numpy array
URL: https://github.com/apache/arrow/pull/1681#discussion_r171375029
 
 

 ##
 File path: python/pyarrow/tests/test_convert_pandas.py
 ##
 @@ -501,6 +501,14 @@ def test_float_nulls(self):
 result = table.to_pandas()
 tm.assert_frame_equal(result, ex_frame)
 
+def test_float_nulls_to_ints(self):
+# ARROW-2135
+df = pd.DataFrame({"a": [1.0, 2.0, pd.np.NaN]})
+schema = pa.schema([pa.field("a", pa.int16(), nullable=True)])
+table = pa.Table.from_pandas(df, schema=schema)
+assert table[0].to_pylist() == [1, 2, None]
+tm.assert_frame_equal(df, table.to_pandas())
 
 Review comment:
   No, I don't think so. I'm not sure we specify the truncation mode anywhere 
either?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] NaN values silently casted to int64 when passing explicit schema for 
> conversion in Table.from_pandas
> -
>
> Key: ARROW-2135
> URL: https://issues.apache.org/jira/browse/ARROW-2135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the 
> NaN is improperly cast. Since pandas casts these to floats, when converted to 
> a table the NaN is interpreted as an integer. This seems like a bug since a 
> known limitation in pandas (the inability to have null valued integers data) 
> is taking precedence over arrow's functionality to store these as an IntArray 
> with nulls.
>  
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1, 2, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.int64(), nullable=True)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1,
>   2,
>   -9223372036854775808
> ]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380990#comment-16380990
 ] 

ASF GitHub Bot commented on ARROW-2135:
---

cpcloud commented on a change in pull request #1681: ARROW-2135: [Python] Fix 
NaN conversion when casting from Numpy array
URL: https://github.com/apache/arrow/pull/1681#discussion_r171374360
 
 

 ##
 File path: python/pyarrow/tests/test_convert_pandas.py
 ##
 @@ -501,6 +501,14 @@ def test_float_nulls(self):
 result = table.to_pandas()
 tm.assert_frame_equal(result, ex_frame)
 
+def test_float_nulls_to_ints(self):
+# ARROW-2135
+df = pd.DataFrame({"a": [1.0, 2.0, pd.np.NaN]})
+schema = pa.schema([pa.field("a", pa.int16(), nullable=True)])
+table = pa.Table.from_pandas(df, schema=schema)
+assert table[0].to_pylist() == [1, 2, None]
+tm.assert_frame_equal(df, table.to_pandas())
 
 Review comment:
   Is there already a test for things like `a = [1.0, 2.0, 3.1, np.nan]` where 
a user passes in an integer type?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] NaN values silently casted to int64 when passing explicit schema for 
> conversion in Table.from_pandas
> -
>
> Key: ARROW-2135
> URL: https://issues.apache.org/jira/browse/ARROW-2135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the 
> NaN is improperly cast. Since pandas casts these to floats, when converted to 
> a table the NaN is interpreted as an integer. This seems like a bug since a 
> known limitation in pandas (the inability to have null valued integers data) 
> is taking precedence over arrow's functionality to store these as an IntArray 
> with nulls.
>  
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1, 2, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.int64(), nullable=True)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1,
>   2,
>   -9223372036854775808
> ]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380986#comment-16380986
 ] 

ASF GitHub Bot commented on ARROW-2135:
---

cpcloud commented on a change in pull request #1681: ARROW-2135: [Python] Fix 
NaN conversion when casting from Numpy array
URL: https://github.com/apache/arrow/pull/1681#discussion_r171373762
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -113,6 +132,66 @@ inline int64_t ValuesToBitmap(PyArrayObject* arr, 
uint8_t* bitmap) {
   return null_count;
 }
 
+class NumPyNullsConverter {
+ public:
+  /// Convert the given array's null values to a null bitmap.
+  /// The null bitmap is only allocated if null values are ever possible.
+  static Status Convert(MemoryPool* pool, PyArrayObject* arr,
+bool use_pandas_null_sentinels,
+std::shared_ptr* out_null_bitmap_,
+int64_t* out_null_count) {
+NumPyNullsConverter converter(pool, arr, use_pandas_null_sentinels);
+RETURN_NOT_OK(VisitNumpyArrayInline(arr, ));
+*out_null_bitmap_ = converter.null_bitmap_;
+*out_null_count = converter.null_count_;
+return Status::OK();
+  }
+
+  template 
+  Status Visit(PyArrayObject* arr) {
+typedef internal::npy_traits traits;
+
+const bool null_sentinels_possible =
+// Observing pandas's null sentinels
+(use_pandas_null_sentinels_ && traits::supports_nulls);
+
+if (null_sentinels_possible) {
+  RETURN_NOT_OK(InitNullBitmap(PyArray_SIZE(arr)));
+  null_count_ = ValuesToBitmap(arr, null_bitmap_data_);
+}
+return Status::OK();
+  }
+
+  // XXX it's the same as NumPyConverter::InitNullBitmap()
+  Status InitNullBitmap(int64_t length) {
+int64_t null_bytes = BitUtil::BytesForBits(length);
+
+null_bitmap_ = std::make_shared(pool_);
+RETURN_NOT_OK(null_bitmap_->Resize(null_bytes));
+
+null_bitmap_data_ = null_bitmap_->mutable_data();
+memset(null_bitmap_data_, 0, static_cast(null_bytes));
 
 Review comment:
   `std::fill(null_bitmap_data_, null_bitmap_data_ + null_bytes, 0)` is a bit 
more idiomatic.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] NaN values silently casted to int64 when passing explicit schema for 
> conversion in Table.from_pandas
> -
>
> Key: ARROW-2135
> URL: https://issues.apache.org/jira/browse/ARROW-2135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the 
> NaN is improperly cast. Since pandas casts these to floats, when converted to 
> a table the NaN is interpreted as an integer. This seems like a bug since a 
> known limitation in pandas (the inability to have null valued integers data) 
> is taking precedence over arrow's functionality to store these as an IntArray 
> with nulls.
>  
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1, 2, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.int64(), nullable=True)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1,
>   2,
>   -9223372036854775808
> ]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380877#comment-16380877
 ] 

ASF GitHub Bot commented on ARROW-2135:
---

pitrou opened a new pull request #1681: ARROW-2135: [Python] Fix NaN conversion 
when casting from Numpy array
URL: https://github.com/apache/arrow/pull/1681
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] NaN values silently casted to int64 when passing explicit schema for 
> conversion in Table.from_pandas
> -
>
> Key: ARROW-2135
> URL: https://issues.apache.org/jira/browse/ARROW-2135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the 
> NaN is improperly cast. Since pandas casts these to floats, when converted to 
> a table the NaN is interpreted as an integer. This seems like a bug since a 
> known limitation in pandas (the inability to have null valued integers data) 
> is taking precedence over arrow's functionality to store these as an IntArray 
> with nulls.
>  
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1, 2, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.int64(), nullable=True)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1,
>   2,
>   -9223372036854775808
> ]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380876#comment-16380876
 ] 

ASF GitHub Bot commented on ARROW-2135:
---

pitrou closed pull request #1681: ARROW-2135: [Python] Fix NaN conversion when 
casting from Numpy array
URL: https://github.com/apache/arrow/pull/1681
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/src/arrow/python/numpy-internal.h 
b/cpp/src/arrow/python/numpy-internal.h
index 6c9c871a1..eee4fa46d 100644
--- a/cpp/src/arrow/python/numpy-internal.h
+++ b/cpp/src/arrow/python/numpy-internal.h
@@ -65,6 +65,9 @@ class Ndarray1DIndexer {
   int64_t stride_;
 };
 
+// Handling of Numpy Types by their static numbers
+// (the NPY_TYPES enum and related defines)
+
 static inline std::string GetNumPyTypeName(int npy_type) {
 #define TYPE_CASE(TYPE, NAME) \
   case NPY_##TYPE:\
@@ -76,14 +79,20 @@ static inline std::string GetNumPyTypeName(int npy_type) {
 TYPE_CASE(INT16, "int16")
 TYPE_CASE(INT32, "int32")
 TYPE_CASE(INT64, "int64")
-#if (NPY_INT64 != NPY_LONGLONG)
+#if !NPY_INT32_IS_INT
+TYPE_CASE(INT, "intc")
+#endif
+#if !NPY_INT64_IS_LONG_LONG
 TYPE_CASE(LONGLONG, "longlong")
 #endif
 TYPE_CASE(UINT8, "uint8")
 TYPE_CASE(UINT16, "uint16")
 TYPE_CASE(UINT32, "uint32")
 TYPE_CASE(UINT64, "uint64")
-#if (NPY_UINT64 != NPY_ULONGLONG)
+#if !NPY_INT32_IS_INT
+TYPE_CASE(UINT, "uintc")
+#endif
+#if !NPY_INT64_IS_LONG_LONG
 TYPE_CASE(ULONGLONG, "ulonglong")
 #endif
 TYPE_CASE(FLOAT16, "float16")
@@ -97,9 +106,48 @@ static inline std::string GetNumPyTypeName(int npy_type) {
   }
 
 #undef TYPE_CASE
-  return "unrecognized type in GetNumPyTypeName";
+  std::stringstream ss;
+  ss << "unrecognized type (" << npy_type << ") in GetNumPyTypeName";
+  return ss.str();
 }
 
+#define TYPE_VISIT_INLINE(TYPE) \
+  case NPY_##TYPE:  \
+return visitor->template Visit(arr);
+
+template 
+inline Status VisitNumpyArrayInline(PyArrayObject* arr, VISITOR* visitor) {
+  switch (PyArray_TYPE(arr)) {
+TYPE_VISIT_INLINE(BOOL);
+TYPE_VISIT_INLINE(INT8);
+TYPE_VISIT_INLINE(UINT8);
+TYPE_VISIT_INLINE(INT16);
+TYPE_VISIT_INLINE(UINT16);
+TYPE_VISIT_INLINE(INT32);
+TYPE_VISIT_INLINE(UINT32);
+TYPE_VISIT_INLINE(INT64);
+TYPE_VISIT_INLINE(UINT64);
+#if !NPY_INT32_IS_INT
+TYPE_VISIT_INLINE(INT);
+TYPE_VISIT_INLINE(UINT);
+#endif
+#if !NPY_INT64_IS_LONG_LONG
+TYPE_VISIT_INLINE(LONGLONG);
+TYPE_VISIT_INLINE(ULONGLONG);
+#endif
+TYPE_VISIT_INLINE(FLOAT16);
+TYPE_VISIT_INLINE(FLOAT32);
+TYPE_VISIT_INLINE(FLOAT64);
+TYPE_VISIT_INLINE(DATETIME);
+TYPE_VISIT_INLINE(OBJECT);
+  }
+  std::stringstream ss;
+  ss << "NumPy type not implemented: " << GetNumPyTypeName(PyArray_TYPE(arr));
+  return Status::NotImplemented(ss.str());
+}
+
+#undef TYPE_VISIT_INLINE
+
 }  // namespace py
 }  // namespace arrow
 
diff --git a/cpp/src/arrow/python/numpy_interop.h 
b/cpp/src/arrow/python/numpy_interop.h
index 8c569e232..3531263a6 100644
--- a/cpp/src/arrow/python/numpy_interop.h
+++ b/cpp/src/arrow/python/numpy_interop.h
@@ -43,6 +43,31 @@
 #include 
 #include 
 
+// A bit subtle.  Numpy has 5 canonical integer types:
+// (or, rather, type pairs: signed and unsigned)
+//   NPY_BYTE, NPY_SHORT, NPY_INT, NPY_LONG, NPY_LONGLONG
+// It also has 4 fixed-width integer aliases.
+// When mapping Arrow integer types to these 4 fixed-width aliases,
+// we always miss one of the canonical types (even though it may
+// have the same width as one of the aliases).
+// Which one depends on the platform...
+// On a LP64 system, NPY_INT64 maps to NPY_LONG and
+// NPY_LONGLONG needs to be handled separately.
+// On a LLP64 system, NPY_INT32 maps to NPY_LONG and
+// NPY_INT needs to be handled separately.
+
+#if NPY_BITSOF_LONG == 32 && NPY_BITSOF_LONGLONG == 64
+#define NPY_INT64_IS_LONG_LONG 1
+#else
+#define NPY_INT64_IS_LONG_LONG 0
+#endif
+
+#if NPY_BITSOF_INT == 32 && NPY_BITSOF_LONG == 64
+#define NPY_INT32_IS_INT 1
+#else
+#define NPY_INT32_IS_INT 0
+#endif
+
 namespace arrow {
 namespace py {
 
diff --git a/cpp/src/arrow/python/numpy_to_arrow.cc 
b/cpp/src/arrow/python/numpy_to_arrow.cc
index 23418ad92..c474fc383 100644
--- a/cpp/src/arrow/python/numpy_to_arrow.cc
+++ b/cpp/src/arrow/python/numpy_to_arrow.cc
@@ -94,6 +94,25 @@ inline bool PyObject_is_integer(PyObject* obj) {
   return (!PyBool_Check(obj)) && PyArray_IsIntegerScalar(obj);
 }
 
+Status CheckFlatNumpyArray(PyArrayObject* numpy_array, int np_type) {
+  if (PyArray_NDIM(numpy_array) != 1) {
+return Status::Invalid("only handle 1-dimensional arrays");
+  }
+
+  const int 

[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380754#comment-16380754
 ] 

ASF GitHub Bot commented on ARROW-2135:
---

pitrou commented on a change in pull request #1681: ARROW-2135: [Python] Fix 
NaN conversion when casting from Numpy array
URL: https://github.com/apache/arrow/pull/1681#discussion_r171330518
 
 

 ##
 File path: cpp/src/arrow/python/numpy-internal.h
 ##
 @@ -76,16 +76,12 @@ static inline std::string GetNumPyTypeName(int npy_type) {
 TYPE_CASE(INT16, "int16")
 TYPE_CASE(INT32, "int32")
 TYPE_CASE(INT64, "int64")
-#if (NPY_INT64 != NPY_LONGLONG)
 
 Review comment:
   For some reason (macro expansion?) these `#if`s wouldn't work correctly 
here, even though `NPY_INT64` is defined to `NPY_LONG`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] NaN values silently casted to int64 when passing explicit schema for 
> conversion in Table.from_pandas
> -
>
> Key: ARROW-2135
> URL: https://issues.apache.org/jira/browse/ARROW-2135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the 
> NaN is improperly cast. Since pandas casts these to floats, when converted to 
> a table the NaN is interpreted as an integer. This seems like a bug since a 
> known limitation in pandas (the inability to have null valued integers data) 
> is taking precedence over arrow's functionality to store these as an IntArray 
> with nulls.
>  
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1, 2, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.int64(), nullable=True)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1,
>   2,
>   -9223372036854775808
> ]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380740#comment-16380740
 ] 

ASF GitHub Bot commented on ARROW-2135:
---

pitrou commented on a change in pull request #1681: ARROW-2135: [Python] Fix 
NaN conversion when casting from Numpy array
URL: https://github.com/apache/arrow/pull/1681#discussion_r171328251
 
 

 ##
 File path: cpp/src/arrow/python/numpy-internal.h
 ##
 @@ -76,16 +76,12 @@ static inline std::string GetNumPyTypeName(int npy_type) {
 TYPE_CASE(INT16, "int16")
 TYPE_CASE(INT32, "int32")
 TYPE_CASE(INT64, "int64")
-#if (NPY_INT64 != NPY_LONGLONG)
 
 Review comment:
   Note those inequalities wouldn't have the expected effect because of how 
macro expansion works (and I don't know how to fix that :-().


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] NaN values silently casted to int64 when passing explicit schema for 
> conversion in Table.from_pandas
> -
>
> Key: ARROW-2135
> URL: https://issues.apache.org/jira/browse/ARROW-2135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the 
> NaN is improperly cast. Since pandas casts these to floats, when converted to 
> a table the NaN is interpreted as an integer. This seems like a bug since a 
> known limitation in pandas (the inability to have null valued integers data) 
> is taking precedence over arrow's functionality to store these as an IntArray 
> with nulls.
>  
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1, 2, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.int64(), nullable=True)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1,
>   2,
>   -9223372036854775808
> ]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380739#comment-16380739
 ] 

ASF GitHub Bot commented on ARROW-2135:
---

pitrou commented on a change in pull request #1681: ARROW-2135: [Python] Fix 
NaN conversion when casting from Numpy array
URL: https://github.com/apache/arrow/pull/1681#discussion_r171328251
 
 

 ##
 File path: cpp/src/arrow/python/numpy-internal.h
 ##
 @@ -76,16 +76,12 @@ static inline std::string GetNumPyTypeName(int npy_type) {
 TYPE_CASE(INT16, "int16")
 TYPE_CASE(INT32, "int32")
 TYPE_CASE(INT64, "int64")
-#if (NPY_INT64 != NPY_LONGLONG)
 
 Review comment:
   Note those inequalities wouldn't have the expected effect because of how 
macro expansion works (and I don't know how to fix that :-().


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] NaN values silently casted to int64 when passing explicit schema for 
> conversion in Table.from_pandas
> -
>
> Key: ARROW-2135
> URL: https://issues.apache.org/jira/browse/ARROW-2135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the 
> NaN is improperly cast. Since pandas casts these to floats, when converted to 
> a table the NaN is interpreted as an integer. This seems like a bug since a 
> known limitation in pandas (the inability to have null valued integers data) 
> is taking precedence over arrow's functionality to store these as an IntArray 
> with nulls.
>  
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1, 2, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.int64(), nullable=True)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1,
>   2,
>   -9223372036854775808
> ]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380730#comment-16380730
 ] 

ASF GitHub Bot commented on ARROW-2135:
---

pitrou opened a new pull request #1681: ARROW-2135: [Python] Fix NaN conversion 
when casting from Numpy array
URL: https://github.com/apache/arrow/pull/1681
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] NaN values silently casted to int64 when passing explicit schema for 
> conversion in Table.from_pandas
> -
>
> Key: ARROW-2135
> URL: https://issues.apache.org/jira/browse/ARROW-2135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the 
> NaN is improperly cast. Since pandas casts these to floats, when converted to 
> a table the NaN is interpreted as an integer. This seems like a bug since a 
> known limitation in pandas (the inability to have null valued integers data) 
> is taking precedence over arrow's functionality to store these as an IntArray 
> with nulls.
>  
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1, 2, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.int64(), nullable=True)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1,
>   2,
>   -9223372036854775808
> ]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)