date:20180416

[jira] [Assigned] (ARROW-2454) [Python] Empty chunked array slice crashes

2018-04-16 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-2454:
-

Assignee: Antoine Pitrou

> [Python] Empty chunked array slice crashes
> --
>
> Key: ARROW-2454
> URL: https://issues.apache.org/jira/browse/ARROW-2454
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>
> {code:python}
> >>> col = pa.Column.from_array('ints', pa.array([1,2,3]))
> >>> col
> 
> chunk 0: 
> [
>   1,
>   2,
>   3
> ]
> >>> col.data
> 
> >>> col.data[:1]
> 
> >>> col.data[:0]
> Erreur de segmentation (core dumped)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2422) Support more filter operators on Hive partitioned Parquet files

2018-04-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439186#comment-16439186
 ] 

ASF GitHub Bot commented on ARROW-2422:
---

jneuff commented on a change in pull request #1861: ARROW-2422 Support more 
operators for partition filtering
URL: https://github.com/apache/arrow/pull/1861#discussion_r181675580
 
 

 ##
 File path: python/pyarrow/tests/test_parquet.py
 ##
 @@ -997,40 +997,159 @@ def test_read_partitioned_directory(tmpdir):
 
 
 @parquet
-def test_read_partitioned_directory_filtered(tmpdir):
-fs = LocalFileSystem.get_instance()
-base_path = str(tmpdir)
-
-import pyarrow.parquet as pq
-
-foo_keys = [0, 1]
-bar_keys = ['a', 'b', 'c']
-partition_spec = [
-['foo', foo_keys],
-['bar', bar_keys]
-]
-N = 30
-
-df = pd.DataFrame({
-'index': np.arange(N),
-'foo': np.array(foo_keys, dtype='i4').repeat(15),
-'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2),
-'values': np.random.randn(N)
-}, columns=['index', 'foo', 'bar', 'values'])
-
-_generate_partition_directories(fs, base_path, partition_spec, df)
-
-dataset = pq.ParquetDataset(
-base_path, filesystem=fs,
-filters=[('foo', '=', 1), ('bar', '!=', 'b')]
+class TestParquetFilter:
+
+def test_equivalency(tmpdir):
+fs = LocalFileSystem.get_instance()
+base_path = str(tmpdir)
+
+import pyarrow.parquet as pq
+
+integer_keys = [0, 1]
+string_keys = ['a', 'b', 'c']
+boolean_keys = [True, False]
+partition_spec = [
+['integer', integer_keys],
+['string', string_keys],
+['boolean', boolean_keys]
+]
+N = 30
+
+df = pd.DataFrame({
+'index': np.arange(N),
+'integer': np.array(integer_keys, dtype='i4').repeat(15),
+'string': np.tile(np.tile(np.array(string_keys, dtype=object), 5), 
2),
+'boolean': np.tile(np.tile(np.array(boolean_keys, dtype='bool'), 
5), 3),
+}, columns=['index', 'integer', 'string', 'boolean'])
+
+_generate_partition_directories(fs, base_path, partition_spec, df)
+
+dataset = pq.ParquetDataset(
+base_path, filesystem=fs,
+filters=[('integer', '=', 1), ('string', '!=', 'b'), ('boolean', 
'==', True)]
+)
+table = dataset.read()
+result_df = (table.to_pandas()
+ .sort_values(by='index')
+ .reset_index(drop=True))
+
+assert 0 not in result_df['integer'].values
+assert 'b' not in result_df['string'].values
+assert False not in result_df['boolean'].values
+
+def test_cutoff_exclusive_integer(tmpdir):
+fs = LocalFileSystem.get_instance()
+base_path = str(tmpdir)
+
+import pyarrow.parquet as pq
+
+integer_keys = [0, 1, 2, 3, 4]
+partition_spec = [
+['integers', integer_keys],
+]
+N = 5
+
+df = pd.DataFrame({
+'index': np.arange(N),
+'integers': np.array(integer_keys, dtype='i4'),
+}, columns=['index', 'integers'])
+
+_generate_partition_directories(fs, base_path, partition_spec, df)
+
+dataset = pq.ParquetDataset(
+base_path, filesystem=fs,
+filters=[
+('integers', '<', 4),
+('integers', '>', 1),
+]
+)
+table = dataset.read()
+result_df = (table.to_pandas()
+ .sort_values(by='index')
+ .reset_index(drop=True))
+
+result_list = [x for x in map(int, result_df['integers'].values)]
+assert result_list == [2, 3]
+
+@pytest.mark.xfail(
+raises=TypeError, reason='We suspect loss of type information in 
creation of categoricals.'
 )
 
 Review comment:
   @xhochy This is the behavior we just told you about offline. 
`result_df['dates'].values` seems to be of type `object` instead of 
`datetime64`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support more filter operators on Hive partitioned Parquet files
> ---
>
> Key: ARROW-2422
> URL: https://issues.apache.org/jira/browse/ARROW-2422
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Julius Neuffer
>Priority: Minor
>  Labels: features, pull-request-available
>
> After implementing basic filters ('=', '!=') on Hive partitioned Parquet 
> f

[jira] [Commented] (ARROW-2454) [Python] Empty chunked array slice crashes

2018-04-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439211#comment-16439211
 ] 

ASF GitHub Bot commented on ARROW-2454:
---

pitrou opened a new pull request #1897: ARROW-2454: [C++] Allow zero-array 
chunked arrays
URL: https://github.com/apache/arrow/pull/1897
 
 
   This allows code to be more regular and less fragile.
   
   Also fix the chunked array slicing logic.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Empty chunked array slice crashes
> --
>
> Key: ARROW-2454
> URL: https://issues.apache.org/jira/browse/ARROW-2454
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> {code:python}
> >>> col = pa.Column.from_array('ints', pa.array([1,2,3]))
> >>> col
> 
> chunk 0: 
> [
>   1,
>   2,
>   3
> ]
> >>> col.data
> 
> >>> col.data[:1]
> 
> >>> col.data[:0]
> Erreur de segmentation (core dumped)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-2454) [Python] Empty chunked array slice crashes

2018-04-16 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2454:
--
Labels: pull-request-available  (was: )

> [Python] Empty chunked array slice crashes
> --
>
> Key: ARROW-2454
> URL: https://issues.apache.org/jira/browse/ARROW-2454
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> {code:python}
> >>> col = pa.Column.from_array('ints', pa.array([1,2,3]))
> >>> col
> 
> chunk 0: 
> [
>   1,
>   2,
>   3
> ]
> >>> col.data
> 
> >>> col.data[:1]
> 
> >>> col.data[:0]
> Erreur de segmentation (core dumped)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439224#comment-16439224
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

pitrou closed pull request #1886: ARROW-2101: [Python/C++] Correctly convert 
numpy arrays of bytes to arrow arrays of strings when user specifies arrow type 
of string
URL: https://github.com/apache/arrow/pull/1886
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/src/arrow/python/numpy_to_arrow.cc 
b/cpp/src/arrow/python/numpy_to_arrow.cc
index e37013c7e..dcb96a48a 100644
--- a/cpp/src/arrow/python/numpy_to_arrow.cc
+++ b/cpp/src/arrow/python/numpy_to_arrow.cc
@@ -228,12 +228,15 @@ static Status AppendObjectBinaries(PyArrayObject* arr, 
PyArrayObject* mask,
 /// can fit
 ///
 /// \param[in] offset starting offset for appending
+/// \param[in] check_valid if set to true and the input array
+/// contains values that cannot be converted to unicode, returns
+/// a Status code containing a Python exception message
 /// \param[out] end_offset ending offset where we stopped appending. Will
 /// be length of arr if fully consumed
 /// \param[out] have_bytes true if we encountered any PyBytes object
 static Status AppendObjectStrings(PyArrayObject* arr, PyArrayObject* mask, 
int64_t offset,
-  StringBuilder* builder, int64_t* end_offset,
-  bool* have_bytes) {
+  bool check_valid, StringBuilder* builder,
+  int64_t* end_offset, bool* have_bytes) {
   PyObject* obj;
 
   Ndarray1DIndexer objects(arr);
@@ -256,8 +259,7 @@ static Status AppendObjectStrings(PyArrayObject* arr, 
PyArrayObject* mask, int64
   *have_bytes = true;
 }
 bool is_full;
-RETURN_NOT_OK(
-internal::BuilderAppend(builder, obj, false /* check_valid */, 
&is_full));
+RETURN_NOT_OK(internal::BuilderAppend(builder, obj, check_valid, 
&is_full));
 if (is_full) {
   break;
 }
@@ -844,6 +846,13 @@ Status NumPyConverter::ConvertObjectStrings() {
   StringBuilder builder(pool_);
   RETURN_NOT_OK(builder.Resize(length_));
 
+  // If the creator of this NumPyConverter specified a type,
+  // then we want to force the output type to be utf8. If
+  // the input data is PyBytes and not PyUnicode and
+  // not convertible to utf8, the call to AppendObjectStrings
+  // below will fail because we pass force_string as the
+  // value for check_valid.
+  bool force_string = type_ != nullptr && type_->Equals(utf8());
   bool global_have_bytes = false;
   if (length_ == 0) {
 // Produce an empty chunk
@@ -854,8 +863,10 @@ Status NumPyConverter::ConvertObjectStrings() {
 int64_t offset = 0;
 while (offset < length_) {
   bool chunk_have_bytes = false;
-  RETURN_NOT_OK(
-  AppendObjectStrings(arr_, mask_, offset, &builder, &offset, 
&chunk_have_bytes));
+  // Always set check_valid to true when force_string is true
+  RETURN_NOT_OK(AppendObjectStrings(arr_, mask_, offset,
+force_string /* check_valid */, 
&builder, &offset,
+&chunk_have_bytes));
 
   global_have_bytes = global_have_bytes | chunk_have_bytes;
   std::shared_ptr chunk;
@@ -864,8 +875,13 @@ Status NumPyConverter::ConvertObjectStrings() {
 }
   }
 
-  // If we saw PyBytes, convert everything to BinaryArray
-  if (global_have_bytes) {
+  // If we saw bytes, convert it to a binary array. If
+  // force_string was set to true, the input data could
+  // have been bytes but we've checked to make sure that
+  // it can be converted to utf-8 in the call to
+  // AppendObjectStrings. In that case, we can safely leave
+  // it as a utf8 type.
+  if (!force_string && global_have_bytes) {
 for (size_t i = 0; i < out_arrays_.size(); ++i) {
   auto binary_data = out_arrays_[i]->data()->Copy();
   binary_data->type = ::arrow::binary();
@@ -1393,8 +1409,12 @@ inline Status 
NumPyConverter::ConvertTypedLists(
   RETURN_NOT_OK(CheckFlatNumpyArray(numpy_array, NPY_OBJECT));
 
   int64_t offset = 0;
-  RETURN_NOT_OK(AppendObjectStrings(numpy_array, nullptr, 0, 
value_builder, &offset,
-&have_bytes));
+  // If a type was specified and it was utf8, then we set
+  // check_valid to true. If any of the input cannot be
+  // converted, then we will exit early here.
+  bool check_valid = type_ != nullptr && type_->Equals(::arrow::utf8());
+  RETURN_NOT_OK(AppendObjectStrings(numpy_array, nullptr, 0, check_valid,
+value_builder,

[jira] [Resolved] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-16 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-2101.
---
   Resolution: Fixed
Fix Version/s: 0.10.0

Issue resolved by pull request 1886
[https://github.com/apache/arrow/pull/1886]

> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439227#comment-16439227
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

pitrou commented on issue #1886: ARROW-2101: [Python/C++] Correctly convert 
numpy arrays of bytes to arrow arrays of strings when user specifies arrow type 
of string
URL: https://github.com/apache/arrow/pull/1886#issuecomment-381550267
 
 
   Thank you @joshuastorck !


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (ARROW-2463) [C++] Update flatbuffers to 1.9.0

2018-04-16 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn reassigned ARROW-2463:
--

Assignee: Uwe L. Korn

> [C++] Update flatbuffers to 1.9.0
> -
>
> Key: ARROW-2463
> URL: https://issues.apache.org/jira/browse/ARROW-2463
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.10.0
>
>
> This will update externalproject and manylinux1 installations of Flatbuffers. 
> The conda-forge update is at 
> https://github.com/conda-forge/flatbuffers-feedstock/pull/9



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2463) [C++] Update flatbuffers to 1.9.0

2018-04-16 Thread Uwe L. Korn (JIRA)

Uwe L. Korn created ARROW-2463:
--

 Summary: [C++] Update flatbuffers to 1.9.0
 Key: ARROW-2463
 URL: https://issues.apache.org/jira/browse/ARROW-2463
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Uwe L. Korn
 Fix For: 0.10.0


This will update externalproject and manylinux1 installations of Flatbuffers. 
The conda-forge update is at 
https://github.com/conda-forge/flatbuffers-feedstock/pull/9



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2464) [Python] Use a python_version marker instead of a condition

2018-04-16 Thread Omer Katz (JIRA)

Omer Katz created ARROW-2464:


 Summary: [Python] Use a python_version marker instead of a 
condition
 Key: ARROW-2464
 URL: https://issues.apache.org/jira/browse/ARROW-2464
 Project: Apache Arrow
  Issue Type: Task
  Components: Packaging, Python
Affects Versions: 0.9.0
Reporter: Omer Katz


When installing pyarrow 0.9.0 pipenv complains that futures has no matching 
versions.

While that may be a bug in pipenv it does not matter. The standard way to 
specify a conditional dependency is using a marker.

We should use the python_version marker to tell pip if it should install 
futures or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2464) [Python] Use a python_version marker instead of a condition

2018-04-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439407#comment-16439407
 ] 

ASF GitHub Bot commented on ARROW-2464:
---

thedrow commented on issue #1879: ARROW-2464: [Python] Use a python_version 
marker instead of a condition
URL: https://github.com/apache/arrow/pull/1879#issuecomment-381591253
 
 
   I opened a ticket. I can't change the branch name without opening a new PR.
   Is this sufficient?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Use a python_version marker instead of a condition
> ---
>
> Key: ARROW-2464
> URL: https://issues.apache.org/jira/browse/ARROW-2464
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Packaging, Python
>Affects Versions: 0.9.0
>Reporter: Omer Katz
>Priority: Minor
>  Labels: pull-request-available
>
> When installing pyarrow 0.9.0 pipenv complains that futures has no matching 
> versions.
> While that may be a bug in pipenv it does not matter. The standard way to 
> specify a conditional dependency is using a marker.
> We should use the python_version marker to tell pip if it should install 
> futures or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-2464) [Python] Use a python_version marker instead of a condition

2018-04-16 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2464:
--
Labels: pull-request-available  (was: )

> [Python] Use a python_version marker instead of a condition
> ---
>
> Key: ARROW-2464
> URL: https://issues.apache.org/jira/browse/ARROW-2464
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Packaging, Python
>Affects Versions: 0.9.0
>Reporter: Omer Katz
>Priority: Minor
>  Labels: pull-request-available
>
> When installing pyarrow 0.9.0 pipenv complains that futures has no matching 
> versions.
> While that may be a bug in pipenv it does not matter. The standard way to 
> specify a conditional dependency is using a marker.
> We should use the python_version marker to tell pip if it should install 
> futures or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2463) [C++] Update flatbuffers to 1.9.0

2018-04-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439461#comment-16439461
 ] 

ASF GitHub Bot commented on ARROW-2463:
---

xhochy opened a new pull request #1898: ARROW-2463: [C++] Update flatbuffers to 
1.9.0
URL: https://github.com/apache/arrow/pull/1898
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Update flatbuffers to 1.9.0
> -
>
> Key: ARROW-2463
> URL: https://issues.apache.org/jira/browse/ARROW-2463
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> This will update externalproject and manylinux1 installations of Flatbuffers. 
> The conda-forge update is at 
> https://github.com/conda-forge/flatbuffers-feedstock/pull/9



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-2463) [C++] Update flatbuffers to 1.9.0

2018-04-16 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2463:
--
Labels: pull-request-available  (was: )

> [C++] Update flatbuffers to 1.9.0
> -
>
> Key: ARROW-2463
> URL: https://issues.apache.org/jira/browse/ARROW-2463
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> This will update externalproject and manylinux1 installations of Flatbuffers. 
> The conda-forge update is at 
> https://github.com/conda-forge/flatbuffers-feedstock/pull/9



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2464) [Python] Use a python_version marker instead of a condition

2018-04-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439590#comment-16439590
 ] 

ASF GitHub Bot commented on ARROW-2464:
---

pitrou commented on issue #1879: ARROW-2464: [Python] Use a python_version 
marker instead of a condition
URL: https://github.com/apache/arrow/pull/1879#issuecomment-381646788
 
 
   Yes, it should be ok. Also thanks for explaining the bug on the JIRA issue.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Use a python_version marker instead of a condition
> ---
>
> Key: ARROW-2464
> URL: https://issues.apache.org/jira/browse/ARROW-2464
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Packaging, Python
>Affects Versions: 0.9.0
>Reporter: Omer Katz
>Priority: Minor
>  Labels: pull-request-available
>
> When installing pyarrow 0.9.0 pipenv complains that futures has no matching 
> versions.
> While that may be a bug in pipenv it does not matter. The standard way to 
> specify a conditional dependency is using a marker.
> We should use the python_version marker to tell pip if it should install 
> futures or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (ARROW-2464) [Python] Use a python_version marker instead of a condition

2018-04-16 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-2464.
---
   Resolution: Fixed
Fix Version/s: 0.10.0

Issue resolved by pull request 1879
[https://github.com/apache/arrow/pull/1879]

> [Python] Use a python_version marker instead of a condition
> ---
>
> Key: ARROW-2464
> URL: https://issues.apache.org/jira/browse/ARROW-2464
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Packaging, Python
>Affects Versions: 0.9.0
>Reporter: Omer Katz
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> When installing pyarrow 0.9.0 pipenv complains that futures has no matching 
> versions.
> While that may be a bug in pipenv it does not matter. The standard way to 
> specify a conditional dependency is using a marker.
> We should use the python_version marker to tell pip if it should install 
> futures or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2464) [Python] Use a python_version marker instead of a condition

2018-04-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439591#comment-16439591
 ] 

ASF GitHub Bot commented on ARROW-2464:
---

pitrou closed pull request #1879: ARROW-2464: [Python] Use a python_version 
marker instead of a condition
URL: https://github.com/apache/arrow/pull/1879
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/python/setup.py b/python/setup.py
index 20b2416da4..8d26e092bc 100644
--- a/python/setup.py
+++ b/python/setup.py
@@ -447,10 +447,11 @@ def has_ext_modules(foo):
 return True
 
 
-install_requires = ['numpy >= 1.10', 'six >= 1.0.0']
-
-if sys.version_info.major == 2:
-install_requires.append('futures')
+install_requires = (
+'numpy >= 1.10',
+'six >= 1.0.0',
+'futures;python_version<"3.2"'
+)
 
 
 def parse_version(root):


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Use a python_version marker instead of a condition
> ---
>
> Key: ARROW-2464
> URL: https://issues.apache.org/jira/browse/ARROW-2464
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Packaging, Python
>Affects Versions: 0.9.0
>Reporter: Omer Katz
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> When installing pyarrow 0.9.0 pipenv complains that futures has no matching 
> versions.
> While that may be a bug in pipenv it does not matter. The standard way to 
> specify a conditional dependency is using a marker.
> We should use the python_version marker to tell pip if it should install 
> futures or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (ARROW-2454) [Python] Empty chunked array slice crashes

2018-04-16 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-2454.
---
   Resolution: Fixed
Fix Version/s: 0.10.0

Issue resolved by pull request 1897
[https://github.com/apache/arrow/pull/1897]

> [Python] Empty chunked array slice crashes
> --
>
> Key: ARROW-2454
> URL: https://issues.apache.org/jira/browse/ARROW-2454
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> {code:python}
> >>> col = pa.Column.from_array('ints', pa.array([1,2,3]))
> >>> col
> 
> chunk 0: 
> [
>   1,
>   2,
>   3
> ]
> >>> col.data
> 
> >>> col.data[:1]
> 
> >>> col.data[:0]
> Erreur de segmentation (core dumped)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2454) [Python] Empty chunked array slice crashes

2018-04-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439595#comment-16439595
 ] 

ASF GitHub Bot commented on ARROW-2454:
---

pitrou closed pull request #1897: ARROW-2454: [C++] Allow zero-array chunked 
arrays
URL: https://github.com/apache/arrow/pull/1897
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/src/arrow/table-test.cc b/cpp/src/arrow/table-test.cc
index b1cf6e59a2..0b9f75df19 100644
--- a/cpp/src/arrow/table-test.cc
+++ b/cpp/src/arrow/table-test.cc
@@ -43,7 +43,9 @@ class TestChunkedArray : public TestBase {
  protected:
   virtual void Construct() {
 one_ = std::make_shared(arrays_one_);
-another_ = std::make_shared(arrays_another_);
+if (!arrays_another_.empty()) {
+  another_ = std::make_shared(arrays_another_);
+}
   }
 
   ArrayVector arrays_one_;
@@ -121,6 +123,23 @@ TEST_F(TestChunkedArray, SliceEquals) {
   std::shared_ptr slice2 = one_->Slice(75)->Slice(25)->Slice(25, 
50);
   ASSERT_EQ(slice2->length(), 50);
   test::AssertChunkedEqual(*slice, *slice2);
+
+  // Making empty slices of a ChunkedArray
+  std::shared_ptr slice3 = one_->Slice(one_->length(), 99);
+  ASSERT_EQ(slice3->length(), 0);
+  ASSERT_EQ(slice3->num_chunks(), 0);
+  ASSERT_TRUE(slice3->type()->Equals(one_->type()));
+
+  std::shared_ptr slice4 = one_->Slice(10, 0);
+  ASSERT_EQ(slice4->length(), 0);
+  ASSERT_EQ(slice4->num_chunks(), 0);
+  ASSERT_TRUE(slice4->type()->Equals(one_->type()));
+
+  // Slicing an empty ChunkedArray
+  std::shared_ptr slice5 = slice4->Slice(0, 10);
+  ASSERT_EQ(slice5->length(), 0);
+  ASSERT_EQ(slice5->num_chunks(), 0);
+  ASSERT_TRUE(slice5->type()->Equals(one_->type()));
 }
 
 class TestColumn : public TestChunkedArray {
diff --git a/cpp/src/arrow/table.cc b/cpp/src/arrow/table.cc
index f6ac6dd3b1..8af47ea220 100644
--- a/cpp/src/arrow/table.cc
+++ b/cpp/src/arrow/table.cc
@@ -39,13 +39,25 @@ namespace arrow {
 ChunkedArray::ChunkedArray(const ArrayVector& chunks) : chunks_(chunks) {
   length_ = 0;
   null_count_ = 0;
+  DCHECK_GT(chunks.size(), 0)
+  << "cannot construct ChunkedArray from empty vector and omitted type";
+  type_ = chunks[0]->type();
   for (const std::shared_ptr& chunk : chunks) {
 length_ += chunk->length();
 null_count_ += chunk->null_count();
   }
 }
 
-std::shared_ptr ChunkedArray::type() const { return 
chunks_[0]->type(); }
+ChunkedArray::ChunkedArray(const ArrayVector& chunks,
+   const std::shared_ptr& type)
+: chunks_(chunks), type_(type) {
+  length_ = 0;
+  null_count_ = 0;
+  for (const std::shared_ptr& chunk : chunks) {
+length_ += chunk->length();
+null_count_ += chunk->null_count();
+  }
+}
 
 bool ChunkedArray::Equals(const ChunkedArray& other) const {
   if (length_ != other.length()) {
@@ -107,20 +119,20 @@ std::shared_ptr ChunkedArray::Slice(int64_t 
offset, int64_t length
   DCHECK_LE(offset, length_);
 
   int curr_chunk = 0;
-  while (offset >= chunk(curr_chunk)->length()) {
+  while (curr_chunk < num_chunks() && offset >= chunk(curr_chunk)->length()) {
 offset -= chunk(curr_chunk)->length();
 curr_chunk++;
   }
 
   ArrayVector new_chunks;
-  while (length > 0 && curr_chunk < num_chunks()) {
+  while (curr_chunk < num_chunks() && length > 0) {
 new_chunks.push_back(chunk(curr_chunk)->Slice(offset, length));
 length -= chunk(curr_chunk)->length() - offset;
 offset = 0;
 curr_chunk++;
   }
 
-  return std::make_shared(new_chunks);
+  return std::make_shared(new_chunks, type_);
 }
 
 std::shared_ptr ChunkedArray::Slice(int64_t offset) const {
@@ -129,15 +141,15 @@ std::shared_ptr ChunkedArray::Slice(int64_t 
offset) const {
 
 Column::Column(const std::shared_ptr& field, const ArrayVector& chunks)
 : field_(field) {
-  data_ = std::make_shared(chunks);
+  data_ = std::make_shared(chunks, field->type());
 }
 
 Column::Column(const std::shared_ptr& field, const 
std::shared_ptr& data)
 : field_(field) {
   if (!data) {
-data_ = std::make_shared(ArrayVector({}));
+data_ = std::make_shared(ArrayVector({}), field->type());
   } else {
-data_ = std::make_shared(ArrayVector({data}));
+data_ = std::make_shared(ArrayVector({data}), field->type());
   }
 }
 
diff --git a/cpp/src/arrow/table.h b/cpp/src/arrow/table.h
index 20d027d6a5..32af224ff4 100644
--- a/cpp/src/arrow/table.h
+++ b/cpp/src/arrow/table.h
@@ -40,6 +40,7 @@ class Status;
 class ARROW_EXPORT ChunkedArray {
  public:
   explicit ChunkedArray(const ArrayVector& chunks);
+  ChunkedArray(const ArrayVector& chunks, const std::shared_ptr& 
type);
 
   /// \return the total length of the chunked array; computed on co

[jira] [Commented] (ARROW-2372) ArrowIOError: Invalid argument

2018-04-16 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439603#comment-16439603
 ] 

Antoine Pitrou commented on ARROW-2372:
---

This may have been fixed with ARROW-2369. Is there a possibility for you to 
test with Arrow git master?

> ArrowIOError: Invalid argument
> --
>
> Key: ARROW-2372
> URL: https://issues.apache.org/jira/browse/ARROW-2372
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0, 0.9.0
> Environment: Ubuntu 16.04
>Reporter: Kyle Barron
>Priority: Major
> Fix For: 0.9.1
>
>
> I get an ArrowIOError when reading a specific file that was also written by 
> pyarrow. Specifically, the traceback is:
> {code:python}
> >>> import pyarrow.parquet as pq
> >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
>  ---
>  ArrowIOError Traceback (most recent call last)
>   in ()
>  > 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
> ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in 
> _init_(self, source, metadata, common_metadata)
>  62 self.reader = ParquetReader()
>  63 source = _ensure_file(source)
>  ---> 64 self.reader.open(source, metadata=metadata)
>  65 self.common_metadata = common_metadata
>  66 self._nested_paths_by_prefix = self._build_nested_paths()
> _parquet.pyx in pyarrow._parquet.ParquetReader.open()
> error.pxi in pyarrow.lib.check_status()
> ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument
> {code}
> Here's a reproducible example with the specific file I'm working with. I'm 
> converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get 
> the source data:
> {code:bash}
> wget 
> https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip
> unzip gaz2016zcta5distancemiles.csv.zip{code}
> Then the basic idea from the [pyarrow Parquet 
> documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing]
>  is instantiating the writer class; looping over chunks of the csv and 
> writing them to parquet; then closing the writer object.
>  
> {code:python}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> from pathlib import Path
> zcta_file = Path('gaz2016zcta5distancemiles.csv')
> itr = pd.read_csv(
> zcta_file,
> header=0,
> dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64},
> engine='c',
> chunksize=64617153)
> schema = pa.schema([
> pa.field('zip1', pa.string()),
> pa.field('zip2', pa.string()),
> pa.field('mi_to_zcta5', pa.float64())])
> writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema)
> print(f'Starting conversion')
> i = 0
> for df in itr:
> i += 1
> print(f'Finished reading csv block {i}')
> table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3)
> writer.write_table(table)
> print(f'Finished writing parquet block {i}')
> writer.close()
> {code}
> Then running this python script produces the file 
> {code:java}
> gaz2016zcta5distancemiles.parquet{code}
> , but just attempting to read the metadata with `pq.ParquetFile()` produces 
> the above exception.
> I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would 
> complain on import of the csv if the columns in the data were not `string`, 
> `string`, and `float64`, so I think creating the Parquet schema in that way 
> should be fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2372) ArrowIOError: Invalid argument

2018-04-16 Thread Kyle Barron (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439646#comment-16439646
 ] 

Kyle Barron commented on ARROW-2372:


Sorry, I couldn't figure out how build Arrow and Parquet. I tried to follow 
[https://github.com/apache/arrow/blob/master/python/doc/source/development.rst] 
with Conda exactly, but I get errors. Specifically I think it's trying to use 
gcc 7.2.0 instead of 4.9. I might just have to wait for 9.1.

> ArrowIOError: Invalid argument
> --
>
> Key: ARROW-2372
> URL: https://issues.apache.org/jira/browse/ARROW-2372
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0, 0.9.0
> Environment: Ubuntu 16.04
>Reporter: Kyle Barron
>Priority: Major
> Fix For: 0.9.1
>
>
> I get an ArrowIOError when reading a specific file that was also written by 
> pyarrow. Specifically, the traceback is:
> {code:python}
> >>> import pyarrow.parquet as pq
> >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
>  ---
>  ArrowIOError Traceback (most recent call last)
>   in ()
>  > 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
> ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in 
> _init_(self, source, metadata, common_metadata)
>  62 self.reader = ParquetReader()
>  63 source = _ensure_file(source)
>  ---> 64 self.reader.open(source, metadata=metadata)
>  65 self.common_metadata = common_metadata
>  66 self._nested_paths_by_prefix = self._build_nested_paths()
> _parquet.pyx in pyarrow._parquet.ParquetReader.open()
> error.pxi in pyarrow.lib.check_status()
> ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument
> {code}
> Here's a reproducible example with the specific file I'm working with. I'm 
> converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get 
> the source data:
> {code:bash}
> wget 
> https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip
> unzip gaz2016zcta5distancemiles.csv.zip{code}
> Then the basic idea from the [pyarrow Parquet 
> documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing]
>  is instantiating the writer class; looping over chunks of the csv and 
> writing them to parquet; then closing the writer object.
>  
> {code:python}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> from pathlib import Path
> zcta_file = Path('gaz2016zcta5distancemiles.csv')
> itr = pd.read_csv(
> zcta_file,
> header=0,
> dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64},
> engine='c',
> chunksize=64617153)
> schema = pa.schema([
> pa.field('zip1', pa.string()),
> pa.field('zip2', pa.string()),
> pa.field('mi_to_zcta5', pa.float64())])
> writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema)
> print(f'Starting conversion')
> i = 0
> for df in itr:
> i += 1
> print(f'Finished reading csv block {i}')
> table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3)
> writer.write_table(table)
> print(f'Finished writing parquet block {i}')
> writer.close()
> {code}
> Then running this python script produces the file 
> {code:java}
> gaz2016zcta5distancemiles.parquet{code}
> , but just attempting to read the metadata with `pq.ParquetFile()` produces 
> the above exception.
> I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would 
> complain on import of the csv if the columns in the data were not `string`, 
> `string`, and `float64`, so I think creating the Parquet schema in that way 
> should be fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439735#comment-16439735
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

BryanCutler commented on issue #1886: ARROW-2101: [Python/C++] Correctly 
convert numpy arrays of bytes to arrow arrays of strings when user specifies 
arrow type of string
URL: https://github.com/apache/arrow/pull/1886#issuecomment-381681329
 
 
   Thanks for the clarification of Python 2 behaviour @xhochy , and thanks for 
the fix @joshuastorck ! 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-16 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned ARROW-2101:
---

Assignee: (was: Bryan Cutler)

> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439743#comment-16439743
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

BryanCutler commented on issue #1886: ARROW-2101: [Python/C++] Correctly 
convert numpy arrays of bytes to arrow arrays of strings when user specifies 
arrow type of string
URL: https://github.com/apache/arrow/pull/1886#issuecomment-381682065
 
 
   @joshuastorck , what is your JIRA username so I can assign the issue to you?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2430) MVP for branch based packaging automation

2018-04-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439807#comment-16439807
 ] 

ASF GitHub Bot commented on ARROW-2430:
---

kszucs commented on issue #1869: ARROW-2430: [Packaging] MVP for branch based 
packaging automation
URL: https://github.com/apache/arrow/pull/1869#issuecomment-381697835
 
 
   @wesm 
[Updated.](https://github.com/kszucs/arrow/blob/6a2b126bcf99b051c5a852afaece01c60586f815/cd/crossbow.py)
 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> MVP for branch based packaging automation
> -
>
> Key: ARROW-2430
> URL: https://issues.apache.org/jira/browse/ARROW-2430
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
>
> Described in 
> https://docs.google.com/document/d/1IyhbQpiElxTsI8HbMZ-g9EGPOtcFdtMBzEyDJv48BKc/edit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2465) [Plasma] plasma_store fails to find libarrow_gpu.so

2018-04-16 Thread Antoine Pitrou (JIRA)

Antoine Pitrou created ARROW-2465:
-

 Summary: [Plasma] plasma_store fails to find libarrow_gpu.so
 Key: ARROW-2465
 URL: https://issues.apache.org/jira/browse/ARROW-2465
 Project: Apache Arrow
  Issue Type: Bug
  Components: GPU, Plasma (C++)
Affects Versions: 0.9.0
Reporter: Antoine Pitrou


After install, I get the following:
{code:bash}
$ which plasma_store
/home/antoine/miniconda3/envs/pyarrow/bin/plasma_store
$ plasma_store
plasma_store: error while loading shared libraries: libarrow_gpu.so.0: cannot 
open shared object file: No such file or directory
$ ldd `which plasma_store`
linux-vdso.so.1 =>  (0x7ffe7bdf)
libarrow_gpu.so.0 => not found
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 
(0x7f5d81676000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 
(0x7f5d812ee000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x7f5d80fe5000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 
(0x7f5d80dce000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x7f5d80a04000)
/lib64/ld-linux-x86-64.so.2 (0x7f5d81893000)
{code}

Note that {{libarrow_gpu.so}} is installed in 
{{/home/antoine/miniconda3/envs/pyarrow/lib/}}

There are probably two solutions:
* link statically with the Arrow GPU libs (I wonder why this isn't done like it 
is for the Arrow libs)
* or make the rpath correct



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2465) [Plasma] plasma_store fails to find libarrow_gpu.so

2018-04-16 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439808#comment-16439808
 ] 

Antoine Pitrou commented on ARROW-2465:
---

[~wapaul]

> [Plasma] plasma_store fails to find libarrow_gpu.so
> ---
>
> Key: ARROW-2465
> URL: https://issues.apache.org/jira/browse/ARROW-2465
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: GPU, Plasma (C++)
>Affects Versions: 0.9.0
>Reporter: Antoine Pitrou
>Priority: Major
>
> After install, I get the following:
> {code:bash}
> $ which plasma_store
> /home/antoine/miniconda3/envs/pyarrow/bin/plasma_store
> $ plasma_store
> plasma_store: error while loading shared libraries: libarrow_gpu.so.0: cannot 
> open shared object file: No such file or directory
> $ ldd `which plasma_store`
>   linux-vdso.so.1 =>  (0x7ffe7bdf)
>   libarrow_gpu.so.0 => not found
>   libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 
> (0x7f5d81676000)
>   libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 
> (0x7f5d812ee000)
>   libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x7f5d80fe5000)
>   libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 
> (0x7f5d80dce000)
>   libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x7f5d80a04000)
>   /lib64/ld-linux-x86-64.so.2 (0x7f5d81893000)
> {code}
> Note that {{libarrow_gpu.so}} is installed in 
> {{/home/antoine/miniconda3/envs/pyarrow/lib/}}
> There are probably two solutions:
> * link statically with the Arrow GPU libs (I wonder why this isn't done like 
> it is for the Arrow libs)
> * or make the rpath correct



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2442) [C++] Disambiguate Builder::Append overloads

2018-04-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439823#comment-16439823
 ] 

ASF GitHub Bot commented on ARROW-2442:
---

pitrou opened a new pull request #1900: ARROW-2442: [C++] Disambiguate builder 
Append() overloads
URL: https://github.com/apache/arrow/pull/1900
 
 
   Vector-style Append() methods are renamed AppendValues().
   The original methods are marked deprecated.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Disambiguate Builder::Append overloads
> 
>
> Key: ARROW-2442
> URL: https://issues.apache.org/jira/browse/ARROW-2442
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.9.0
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: beginner, pull-request-available
>
> See discussion in 
> [https://github.com/apache/arrow/pull/1852#discussion_r179919627]
> There are various {{Append()}} overloads in Builder and subclasses, some of 
> which append one value, some of which append multiple values at once.
> The API might be clearer and less error-prone if multiple-append variants 
> were named differently, for example {{AppendValues()}}. Especially with the 
> pointer-taking variants, it's probably easy to call the wrong overload by 
> mistake.
> The existing methods would have to go through a deprecation cycle.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-2442) [C++] Disambiguate Builder::Append overloads

2018-04-16 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2442:
--
Labels: beginner pull-request-available  (was: beginner)

> [C++] Disambiguate Builder::Append overloads
> 
>
> Key: ARROW-2442
> URL: https://issues.apache.org/jira/browse/ARROW-2442
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.9.0
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: beginner, pull-request-available
>
> See discussion in 
> [https://github.com/apache/arrow/pull/1852#discussion_r179919627]
> There are various {{Append()}} overloads in Builder and subclasses, some of 
> which append one value, some of which append multiple values at once.
> The API might be clearer and less error-prone if multiple-append variants 
> were named differently, for example {{AppendValues()}}. Especially with the 
> pointer-taking variants, it's probably easy to call the wrong overload by 
> mistake.
> The existing methods would have to go through a deprecation cycle.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2465) [Plasma] plasma_store fails to find libarrow_gpu.so

2018-04-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439825#comment-16439825
 ] 

ASF GitHub Bot commented on ARROW-2465:
---

pitrou opened a new pull request #1901: ARROW-2465: [Plasma/GPU] Preserve 
plasma_store rpath
URL: https://github.com/apache/arrow/pull/1901
 
 
   This allows it to find libarrow_gpu.so when installed


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Plasma] plasma_store fails to find libarrow_gpu.so
> ---
>
> Key: ARROW-2465
> URL: https://issues.apache.org/jira/browse/ARROW-2465
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: GPU, Plasma (C++)
>Affects Versions: 0.9.0
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> After install, I get the following:
> {code:bash}
> $ which plasma_store
> /home/antoine/miniconda3/envs/pyarrow/bin/plasma_store
> $ plasma_store
> plasma_store: error while loading shared libraries: libarrow_gpu.so.0: cannot 
> open shared object file: No such file or directory
> $ ldd `which plasma_store`
>   linux-vdso.so.1 =>  (0x7ffe7bdf)
>   libarrow_gpu.so.0 => not found
>   libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 
> (0x7f5d81676000)
>   libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 
> (0x7f5d812ee000)
>   libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x7f5d80fe5000)
>   libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 
> (0x7f5d80dce000)
>   libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x7f5d80a04000)
>   /lib64/ld-linux-x86-64.so.2 (0x7f5d81893000)
> {code}
> Note that {{libarrow_gpu.so}} is installed in 
> {{/home/antoine/miniconda3/envs/pyarrow/lib/}}
> There are probably two solutions:
> * link statically with the Arrow GPU libs (I wonder why this isn't done like 
> it is for the Arrow libs)
> * or make the rpath correct



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2442) [C++] Disambiguate Builder::Append overloads

2018-04-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439826#comment-16439826
 ] 

ASF GitHub Bot commented on ARROW-2442:
---

pitrou commented on issue #1900: ARROW-2442: [C++] Disambiguate builder 
Append() overloads
URL: https://github.com/apache/arrow/pull/1900#issuecomment-381702739
 
 
   Is it worth adding deprecation pragmas so that users of those functions get 
a compiler warning?
   
   See https://stackoverflow.com/questions/295120/c-mark-as-deprecated


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Disambiguate Builder::Append overloads
> 
>
> Key: ARROW-2442
> URL: https://issues.apache.org/jira/browse/ARROW-2442
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.9.0
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: beginner, pull-request-available
>
> See discussion in 
> [https://github.com/apache/arrow/pull/1852#discussion_r179919627]
> There are various {{Append()}} overloads in Builder and subclasses, some of 
> which append one value, some of which append multiple values at once.
> The API might be clearer and less error-prone if multiple-append variants 
> were named differently, for example {{AppendValues()}}. Especially with the 
> pointer-taking variants, it's probably easy to call the wrong overload by 
> mistake.
> The existing methods would have to go through a deprecation cycle.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-2465) [Plasma] plasma_store fails to find libarrow_gpu.so

2018-04-16 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2465:
--
Labels: pull-request-available  (was: )

> [Plasma] plasma_store fails to find libarrow_gpu.so
> ---
>
> Key: ARROW-2465
> URL: https://issues.apache.org/jira/browse/ARROW-2465
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: GPU, Plasma (C++)
>Affects Versions: 0.9.0
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> After install, I get the following:
> {code:bash}
> $ which plasma_store
> /home/antoine/miniconda3/envs/pyarrow/bin/plasma_store
> $ plasma_store
> plasma_store: error while loading shared libraries: libarrow_gpu.so.0: cannot 
> open shared object file: No such file or directory
> $ ldd `which plasma_store`
>   linux-vdso.so.1 =>  (0x7ffe7bdf)
>   libarrow_gpu.so.0 => not found
>   libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 
> (0x7f5d81676000)
>   libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 
> (0x7f5d812ee000)
>   libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x7f5d80fe5000)
>   libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 
> (0x7f5d80dce000)
>   libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x7f5d80a04000)
>   /lib64/ld-linux-x86-64.so.2 (0x7f5d81893000)
> {code}
> Note that {{libarrow_gpu.so}} is installed in 
> {{/home/antoine/miniconda3/envs/pyarrow/lib/}}
> There are probably two solutions:
> * link statically with the Arrow GPU libs (I wonder why this isn't done like 
> it is for the Arrow libs)
> * or make the rpath correct



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-1993) [Python] Add function for determining implied Arrow schema from pandas.DataFrame

2018-04-16 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated ARROW-1993:
---
Labels: beginner  (was: )

> [Python] Add function for determining implied Arrow schema from 
> pandas.DataFrame
> 
>
> Key: ARROW-1993
> URL: https://issues.apache.org/jira/browse/ARROW-1993
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: beginner
> Fix For: 0.10.0
>
>
> Currently the only option is to use {{Table/Array.from_pandas}} which does 
> significant unnecessary work and allocates memory. If only the schema is of 
> interest, then we could do less work and not allocate memory



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-1993) [Python] Add function for determining implied Arrow schema from pandas.DataFrame

2018-04-16 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated ARROW-1993:
---
Description: 
Currently the only option is to use {{Table/Array.from_pandas}} which does 
significant unnecessary work and allocates memory. If only the schema is of 
interest, then we could do less work and not allocate memory.

We should provide the user a function {{pyarrow.Schema.from_pandas}} which 
takes a DataFrame as an input and returns the respective Arrow schema.

  was:Currently the only option is to use {{Table/Array.from_pandas}} which 
does significant unnecessary work and allocates memory. If only the schema is 
of interest, then we could do less work and not allocate memory


> [Python] Add function for determining implied Arrow schema from 
> pandas.DataFrame
> 
>
> Key: ARROW-1993
> URL: https://issues.apache.org/jira/browse/ARROW-1993
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: beginner
> Fix For: 0.10.0
>
>
> Currently the only option is to use {{Table/Array.from_pandas}} which does 
> significant unnecessary work and allocates memory. If only the schema is of 
> interest, then we could do less work and not allocate memory.
> We should provide the user a function {{pyarrow.Schema.from_pandas}} which 
> takes a DataFrame as an input and returns the respective Arrow schema.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-1894) [Python] Treat CPython memoryview or buffer objects equivalently to pyarrow.Buffer in pyarrow.serialize

2018-04-16 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated ARROW-1894:
---
Labels: beginner  (was: )

> [Python] Treat CPython memoryview or buffer objects equivalently to 
> pyarrow.Buffer in pyarrow.serialize
> ---
>
> Key: ARROW-1894
> URL: https://issues.apache.org/jira/browse/ARROW-1894
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: beginner
> Fix For: 0.10.0
>
>
> These should be treated as Buffer-like on serialize. We should consider how 
> to "box" the buffers as the appropriate kind of object (Buffer, memoryview, 
> etc.) when being deserialized



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2018-04-16 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated ARROW-1983:
---
Description: 
Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
(mostly just schema information). It would be useful to add the ability to 
write a {{_metadata}} file as well. This should include information about each 
row group in the dataset, including summary statistics. Having this summary 
file would allow filtering of row groups without needing to access each file 
beforehand.

This would require that the user is able to get the written RowGroups out of a 
{{pyarrow.parquet.write_table}} call and then give these objects as a list to 
new function that then passes them on as C++ objects to {{parquet-cpp}} that 
generates the respective {{_metadata}} file.

  was:Currently `pyarrow.parquet` can only write the `_common_metadata` file 
(mostly just schema information). It would be useful to add the ability to 
write a `_metadata` file as well. This should include information about each 
row group in the dataset, including summary statistics. Having this summary 
file would allow filtering of row groups without needing to access each file 
beforehand.


> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Jim Crist
>Priority: Major
> Fix For: 0.10.0
>
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-1731) [Python] Provide for selecting a subset of columns to convert in RecordBatch/Table.from_pandas

2018-04-16 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated ARROW-1731:
---
Labels: beginner  (was: )

> [Python] Provide for selecting a subset of columns to convert in 
> RecordBatch/Table.from_pandas
> --
>
> Key: ARROW-1731
> URL: https://issues.apache.org/jira/browse/ARROW-1731
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: beginner
> Fix For: 0.10.0
>
>
> Currently it's all-or-nothing, and to do the subsetting in pandas incurs a 
> data copy. This would enable columns (by name or index) to be selected out 
> without additional data copying
> cc [~cpcloud] [~jreback]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-2082) [Python] SegFault in pyarrow.parquet.write_table with specific options

2018-04-16 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2082:
--
Labels: pull-request-available  (was: )

> [Python] SegFault in pyarrow.parquet.write_table with specific options
> --
>
> Key: ARROW-2082
> URL: https://issues.apache.org/jira/browse/ARROW-2082
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: tested on MacOS High Sierra with python 3.6 and Ubuntu 
> Xenial (Python 3.5)
>Reporter: Clément Bouscasse
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> I originally filed an issue in the pandas project but we've tracked it down 
> to arrow itself, when called via pandas in specific circumstances:
> [https://github.com/pandas-dev/pandas/issues/19493]
> basically using
> {code:java}
>  df.to_parquet('filename.parquet', flavor='spark'){code}
> gives a seg fault if `df` contains a datetime column.
> Under the covers,  pandas translates this to the following call:
> {code:java}
> pq.write_table(table, 'output.parquet', flavor='spark', compression='snappy', 
> coerce_timestamps='ms')
> {code}
> which gives me an instant crash.
> There is a repro on the github ticket.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2082) [Python] SegFault in pyarrow.parquet.write_table with specific options

2018-04-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439852#comment-16439852
 ] 

ASF GitHub Bot commented on ARROW-2082:
---

joshuastorck opened a new pull request #456: ARROW-2082: Prevent segfault that 
was occurring when writing a nanosecond timestamp with arrow writer properties 
set to coerce timestamps and support deprecated int96 timestamps.
URL: https://github.com/apache/parquet-cpp/pull/456
 
 
   The bug was a due to the fact that the physical type was int64 but the 
WriteTimestamps function was taking a path that assumed the physical type was 
int96. This caused memory corruption because it was writing past the end of the 
array. The bug was fixed by checking that coerce timestamps is disabled when 
writing int96. 
   
   A unit test was added for the regression.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] SegFault in pyarrow.parquet.write_table with specific options
> --
>
> Key: ARROW-2082
> URL: https://issues.apache.org/jira/browse/ARROW-2082
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: tested on MacOS High Sierra with python 3.6 and Ubuntu 
> Xenial (Python 3.5)
>Reporter: Clément Bouscasse
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> I originally filed an issue in the pandas project but we've tracked it down 
> to arrow itself, when called via pandas in specific circumstances:
> [https://github.com/pandas-dev/pandas/issues/19493]
> basically using
> {code:java}
>  df.to_parquet('filename.parquet', flavor='spark'){code}
> gives a seg fault if `df` contains a datetime column.
> Under the covers,  pandas translates this to the following call:
> {code:java}
> pq.write_table(table, 'output.parquet', flavor='spark', compression='snappy', 
> coerce_timestamps='ms')
> {code}
> which gives me an instant crash.
> There is a repro on the github ticket.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-799) [Java] Provide guidance in documentation for using Arrow in an uberjar setting

2018-04-16 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated ARROW-799:
--
Labels: beginner  (was: )

> [Java] Provide guidance in documentation for using Arrow in an uberjar 
> setting 
> ---
>
> Key: ARROW-799
> URL: https://issues.apache.org/jira/browse/ARROW-799
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Jingyuan Wang
>Assignee: Li Jin
>Priority: Major
>  Labels: beginner
> Fix For: 0.10.0
>
>
> Currently, ArrowBuf class directly access the package-private fields of 
> AbstractByteBuf class which makes shading Apache Arrow problematic. If we 
> relocate io.netty namespace excluding io.netty.buffer.ArrowBuf, it would 
> throw out IllegalAccessException.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-889) [Python] Add nicer repr for Column

2018-04-16 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated ARROW-889:
--
Labels: beginner  (was: )

> [Python] Add nicer __repr__ for Column
> --
>
> Key: ARROW-889
> URL: https://issues.apache.org/jira/browse/ARROW-889
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: beginner
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-1715) [Python] Implement pickling for Array, Column, ChunkedArray, RecordBatch, Table

2018-04-16 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated ARROW-1715:
---
Labels: beginner  (was: )

> [Python] Implement pickling for Array, Column, ChunkedArray, RecordBatch, 
> Table
> ---
>
> Key: ARROW-1715
> URL: https://issues.apache.org/jira/browse/ARROW-1715
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: beginner
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-799) [Java] Provide guidance in documentation for using Arrow in an uberjar setting

2018-04-16 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated ARROW-799:
--
Issue Type: Improvement  (was: Task)

> [Java] Provide guidance in documentation for using Arrow in an uberjar 
> setting 
> ---
>
> Key: ARROW-799
> URL: https://issues.apache.org/jira/browse/ARROW-799
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Jingyuan Wang
>Assignee: Li Jin
>Priority: Major
>  Labels: beginner
> Fix For: 0.10.0
>
>
> Currently, ArrowBuf class directly access the package-private fields of 
> AbstractByteBuf class which makes shading Apache Arrow problematic. If we 
> relocate io.netty namespace excluding io.netty.buffer.ArrowBuf, it would 
> throw out IllegalAccessException.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-1388) [Python] Add Table.drop method for removing columns

2018-04-16 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated ARROW-1388:
---
Description: See ARROW-1374 for a use case. This function should take as an 
input a list of columns and return a new Table instance without them.  (was: 
See ARROW-1374 for a use case)

> [Python] Add Table.drop method for removing columns
> ---
>
> Key: ARROW-1388
> URL: https://issues.apache.org/jira/browse/ARROW-1388
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: beginner
> Fix For: 0.10.0
>
>
> See ARROW-1374 for a use case. This function should take as an input a list 
> of columns and return a new Table instance without them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439856#comment-16439856
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

joshuastorck commented on issue #1886: ARROW-2101: [Python/C++] Correctly 
convert numpy arrays of bytes to arrow arrays of strings when user specifies 
arrow type of string
URL: https://github.com/apache/arrow/pull/1886#issuecomment-381708268
 
 
   @BryanCutler, my JIRA username is joshuastorck


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-2057) [Python] Configure size of data pages in pyarrow.parquet.write_table

2018-04-16 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated ARROW-2057:
---
Description: 
It would be useful to be able to set the size of data pages (within Parquet 
column chunks) from Python. The current default is set to 1MiB at 
https://github.com/apache/parquet-cpp/blob/0875e43010af485e1c0b506d77d7e0edc80c66cc/src/parquet/properties.h#L81.
 It might be useful in some situations to lower this for more granular access.

We should provide this value as a parameter to {{pyarrow.parquet.write_table}}.

  was:It would be useful to be able to set the size of data pages (within 
Parquet column chunks) from Python


> [Python] Configure size of data pages in pyarrow.parquet.write_table
> 
>
> Key: ARROW-2057
> URL: https://issues.apache.org/jira/browse/ARROW-2057
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: beginner
> Fix For: 0.10.0
>
>
> It would be useful to be able to set the size of data pages (within Parquet 
> column chunks) from Python. The current default is set to 1MiB at 
> https://github.com/apache/parquet-cpp/blob/0875e43010af485e1c0b506d77d7e0edc80c66cc/src/parquet/properties.h#L81.
>  It might be useful in some situations to lower this for more granular access.
> We should provide this value as a parameter to 
> {{pyarrow.parquet.write_table}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2372) ArrowIOError: Invalid argument

2018-04-16 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439884#comment-16439884
 ] 

Antoine Pitrou commented on ARROW-2372:
---

Ok, I have downloaded the dataset and confirms that it works on git master.

> ArrowIOError: Invalid argument
> --
>
> Key: ARROW-2372
> URL: https://issues.apache.org/jira/browse/ARROW-2372
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0, 0.9.0
> Environment: Ubuntu 16.04
>Reporter: Kyle Barron
>Priority: Major
> Fix For: 0.9.1
>
>
> I get an ArrowIOError when reading a specific file that was also written by 
> pyarrow. Specifically, the traceback is:
> {code:python}
> >>> import pyarrow.parquet as pq
> >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
>  ---
>  ArrowIOError Traceback (most recent call last)
>   in ()
>  > 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
> ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in 
> _init_(self, source, metadata, common_metadata)
>  62 self.reader = ParquetReader()
>  63 source = _ensure_file(source)
>  ---> 64 self.reader.open(source, metadata=metadata)
>  65 self.common_metadata = common_metadata
>  66 self._nested_paths_by_prefix = self._build_nested_paths()
> _parquet.pyx in pyarrow._parquet.ParquetReader.open()
> error.pxi in pyarrow.lib.check_status()
> ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument
> {code}
> Here's a reproducible example with the specific file I'm working with. I'm 
> converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get 
> the source data:
> {code:bash}
> wget 
> https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip
> unzip gaz2016zcta5distancemiles.csv.zip{code}
> Then the basic idea from the [pyarrow Parquet 
> documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing]
>  is instantiating the writer class; looping over chunks of the csv and 
> writing them to parquet; then closing the writer object.
>  
> {code:python}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> from pathlib import Path
> zcta_file = Path('gaz2016zcta5distancemiles.csv')
> itr = pd.read_csv(
> zcta_file,
> header=0,
> dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64},
> engine='c',
> chunksize=64617153)
> schema = pa.schema([
> pa.field('zip1', pa.string()),
> pa.field('zip2', pa.string()),
> pa.field('mi_to_zcta5', pa.float64())])
> writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema)
> print(f'Starting conversion')
> i = 0
> for df in itr:
> i += 1
> print(f'Finished reading csv block {i}')
> table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3)
> writer.write_table(table)
> print(f'Finished writing parquet block {i}')
> writer.close()
> {code}
> Then running this python script produces the file 
> {code:java}
> gaz2016zcta5distancemiles.parquet{code}
> , but just attempting to read the metadata with `pq.ParquetFile()` produces 
> the above exception.
> I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would 
> complain on import of the csv if the columns in the data were not `string`, 
> `string`, and `float64`, so I think creating the Parquet schema in that way 
> should be fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2372) ArrowIOError: Invalid argument

2018-04-16 Thread Kyle Barron (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439885#comment-16439885
 ] 

Kyle Barron commented on ARROW-2372:


Awesome thanks!

> ArrowIOError: Invalid argument
> --
>
> Key: ARROW-2372
> URL: https://issues.apache.org/jira/browse/ARROW-2372
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0, 0.9.0
> Environment: Ubuntu 16.04
>Reporter: Kyle Barron
>Priority: Major
> Fix For: 0.9.1
>
>
> I get an ArrowIOError when reading a specific file that was also written by 
> pyarrow. Specifically, the traceback is:
> {code:python}
> >>> import pyarrow.parquet as pq
> >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
>  ---
>  ArrowIOError Traceback (most recent call last)
>   in ()
>  > 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
> ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in 
> _init_(self, source, metadata, common_metadata)
>  62 self.reader = ParquetReader()
>  63 source = _ensure_file(source)
>  ---> 64 self.reader.open(source, metadata=metadata)
>  65 self.common_metadata = common_metadata
>  66 self._nested_paths_by_prefix = self._build_nested_paths()
> _parquet.pyx in pyarrow._parquet.ParquetReader.open()
> error.pxi in pyarrow.lib.check_status()
> ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument
> {code}
> Here's a reproducible example with the specific file I'm working with. I'm 
> converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get 
> the source data:
> {code:bash}
> wget 
> https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip
> unzip gaz2016zcta5distancemiles.csv.zip{code}
> Then the basic idea from the [pyarrow Parquet 
> documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing]
>  is instantiating the writer class; looping over chunks of the csv and 
> writing them to parquet; then closing the writer object.
>  
> {code:python}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> from pathlib import Path
> zcta_file = Path('gaz2016zcta5distancemiles.csv')
> itr = pd.read_csv(
> zcta_file,
> header=0,
> dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64},
> engine='c',
> chunksize=64617153)
> schema = pa.schema([
> pa.field('zip1', pa.string()),
> pa.field('zip2', pa.string()),
> pa.field('mi_to_zcta5', pa.float64())])
> writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema)
> print(f'Starting conversion')
> i = 0
> for df in itr:
> i += 1
> print(f'Finished reading csv block {i}')
> table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3)
> writer.write_table(table)
> print(f'Finished writing parquet block {i}')
> writer.close()
> {code}
> Then running this python script produces the file 
> {code:java}
> gaz2016zcta5distancemiles.parquet{code}
> , but just attempting to read the metadata with `pq.ParquetFile()` produces 
> the above exception.
> I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would 
> complain on import of the csv if the columns in the data were not `string`, 
> `string`, and `float64`, so I think creating the Parquet schema in that way 
> should be fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Closed] (ARROW-2372) ArrowIOError: Invalid argument

2018-04-16 Thread Kyle Barron (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Barron closed ARROW-2372.
--
Resolution: Fixed

> ArrowIOError: Invalid argument
> --
>
> Key: ARROW-2372
> URL: https://issues.apache.org/jira/browse/ARROW-2372
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0, 0.9.0
> Environment: Ubuntu 16.04
>Reporter: Kyle Barron
>Priority: Major
> Fix For: 0.9.1
>
>
> I get an ArrowIOError when reading a specific file that was also written by 
> pyarrow. Specifically, the traceback is:
> {code:python}
> >>> import pyarrow.parquet as pq
> >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
>  ---
>  ArrowIOError Traceback (most recent call last)
>   in ()
>  > 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
> ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in 
> _init_(self, source, metadata, common_metadata)
>  62 self.reader = ParquetReader()
>  63 source = _ensure_file(source)
>  ---> 64 self.reader.open(source, metadata=metadata)
>  65 self.common_metadata = common_metadata
>  66 self._nested_paths_by_prefix = self._build_nested_paths()
> _parquet.pyx in pyarrow._parquet.ParquetReader.open()
> error.pxi in pyarrow.lib.check_status()
> ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument
> {code}
> Here's a reproducible example with the specific file I'm working with. I'm 
> converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get 
> the source data:
> {code:bash}
> wget 
> https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip
> unzip gaz2016zcta5distancemiles.csv.zip{code}
> Then the basic idea from the [pyarrow Parquet 
> documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing]
>  is instantiating the writer class; looping over chunks of the csv and 
> writing them to parquet; then closing the writer object.
>  
> {code:python}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> from pathlib import Path
> zcta_file = Path('gaz2016zcta5distancemiles.csv')
> itr = pd.read_csv(
> zcta_file,
> header=0,
> dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64},
> engine='c',
> chunksize=64617153)
> schema = pa.schema([
> pa.field('zip1', pa.string()),
> pa.field('zip2', pa.string()),
> pa.field('mi_to_zcta5', pa.float64())])
> writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema)
> print(f'Starting conversion')
> i = 0
> for df in itr:
> i += 1
> print(f'Finished reading csv block {i}')
> table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3)
> writer.write_table(table)
> print(f'Finished writing parquet block {i}')
> writer.close()
> {code}
> Then running this python script produces the file 
> {code:java}
> gaz2016zcta5distancemiles.parquet{code}
> , but just attempting to read the metadata with `pq.ParquetFile()` produces 
> the above exception.
> I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would 
> complain on import of the csv if the columns in the data were not `string`, 
> `string`, and `float64`, so I think creating the Parquet schema in that way 
> should be fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2429) [Python] Timestamp unit in schema changes when writing to Parquet file then reading back

2018-04-16 Thread Joshua Storck (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439892#comment-16439892
 ] 

Joshua Storck commented on ARROW-2429:
--

If you invoke the write_table function as follows, the type will not change:

{code:python}
pq.write_table(table, 'foo.parquet', use_deprecated_int96_timestamps=True)
{code}


> [Python] Timestamp unit in schema changes when writing to Parquet file then 
> reading back
> 
>
> Key: ARROW-2429
> URL: https://issues.apache.org/jira/browse/ARROW-2429
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
> Environment: Mac OS High Sierra
> PyArrow 0.9.0 (py36_1)
> Python
>Reporter: Dave Challis
>Priority: Minor
>
> When creating an Arrow table from a Pandas DataFrame, the table schema 
> contains a field of type `timestamp[ns]`.
> When serialising that table to a parquet file and then immediately reading it 
> back, the schema of the table read instead contains a field with type 
> `timestamp[us]`.
> Minimal example:
>  
> {code:python}
> #!/usr/bin/env python
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> # create DataFrame with a datetime column
> df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']})
> df['created'] = pd.to_datetime(df['created'])
> # create Arrow table from DataFrame
> table = pa.Table.from_pandas(df, preserve_index=False)
> # write the table as a parquet file, then read it back again
> pq.write_table(table, 'foo.parquet')
> table2 = pq.read_table('foo.parquet')
> print(table.schema[0])  # pyarrow.Field (nanosecond 
> units)
> print(table2.schema[0]) # pyarrow.Field (microsecond 
> units)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439912#comment-16439912
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

BryanCutler commented on issue #1886: ARROW-2101: [Python/C++] Correctly 
convert numpy arrays of bytes to arrow arrays of strings when user specifies 
arrow type of string
URL: https://github.com/apache/arrow/pull/1886#issuecomment-381719872
 
 
   It looks like you need to be given rights to have issues assigned, and I 
guess I'm not able to do that.  @pitrou or @xhochy , would you mind doing this?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439914#comment-16439914
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

pitrou commented on issue #1886: ARROW-2101: [Python/C++] Correctly convert 
numpy arrays of bytes to arrow arrays of strings when user specifies arrow type 
of string
URL: https://github.com/apache/arrow/pull/1886#issuecomment-381720123
 
 
   I'm not able to it either, but I think @xhochy  is :-)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2393) [C++] arrow/status.h does not define ARROW_CHECK needed for ARROW_CHECK_OK

2018-04-16 Thread Joshua Storck (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439950#comment-16439950
 ] 

Joshua Storck commented on ARROW-2393:
--

I don't think the ARROW_CHECK_OK and ARROW_CHECK_OK_PREPEND macros should be in 
status.h. They use the logging facilities and should probably be in logging.h, 
which shouldn't be visible.

The interesting thing is that the RETURN_NOT_OK macros don't work outside of 
the arrow namespace. I think they need to be updated to use ::arrow::Status in 
their bodies.

[~wesmckinn], [~pitrou], or [~cpcloud], does that make sense? If so, I'll 
submit a PR.

> [C++] arrow/status.h does not define ARROW_CHECK needed for ARROW_CHECK_OK
> --
>
> Key: ARROW-2393
> URL: https://issues.apache.org/jira/browse/ARROW-2393
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.9.0
>Reporter: dennis lucero
>Priority: Trivial
>
> test.cpp
> {code:c++}
> #include 
> int main(void) {
> arrow::Int64Builder i64builder;
> std::shared_ptr i64array;
> ARROW_CHECK_OK(i64builder.Finish(&i64array));
> return EXIT_SUCCESS;
> }
> {code}
> Attempt to build:
> {code:bash}
> $CXX test.cpp -std=c++11 -larrow
> {code}
> Error:
> {code}
> test.cpp:6:2: error: use of undeclared identifier 'ARROW_CHECK' 
> ARROW_CHECK_OK(i64builder.Finish(&i64array)); ^ 
> xxx/include/arrow/status.h:49:27: note: expanded from macro 'ARROW_CHECK_OK' 
> #define ARROW_CHECK_OK(s) ARROW_CHECK_OK_PREPEND(s, "Bad status") ^ 
> xxx/include/arrow/status.h:44:5: note: expanded from macro 
> 'ARROW_CHECK_OK_PREPEND' ARROW_CHECK(_s.ok()) << (msg) << ": " << 
> _s.ToString(); \ ^ 1 error generated.
> {code}
> I expect that ARROW_* macro are public API, and should work out of the box.
> A naive attempt to fix it
> {code}
> diff --git a/cpp/src/arrow/status.h b/cpp/src/arrow/status.h
> index 84f55e41..6da4a773 100644
> --- a/cpp/src/arrow/status.h
> +++ b/cpp/src/arrow/status.h
> @@ -25,6 +25,7 @@
>  #include "arrow/util/macros.h"
>  #include "arrow/util/visibility.h"
> +#include "arrow/util/logging.h"
>  // Return the given status if it is not OK.
>  #define ARROW_RETURN_NOT_OK(s)   \
> {code}
> fails with
> {code}
> public-api-test.cc:21:2: error: "DCHECK should not be visible from Arrow 
> public headers."
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2393) [C++] arrow/status.h does not define ARROW_CHECK needed for ARROW_CHECK_OK

2018-04-16 Thread Phillip Cloud (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439953#comment-16439953
 ] 

Phillip Cloud commented on ARROW-2393:
--

That sounds right to me.

> [C++] arrow/status.h does not define ARROW_CHECK needed for ARROW_CHECK_OK
> --
>
> Key: ARROW-2393
> URL: https://issues.apache.org/jira/browse/ARROW-2393
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.9.0
>Reporter: dennis lucero
>Priority: Trivial
>
> test.cpp
> {code:c++}
> #include 
> int main(void) {
> arrow::Int64Builder i64builder;
> std::shared_ptr i64array;
> ARROW_CHECK_OK(i64builder.Finish(&i64array));
> return EXIT_SUCCESS;
> }
> {code}
> Attempt to build:
> {code:bash}
> $CXX test.cpp -std=c++11 -larrow
> {code}
> Error:
> {code}
> test.cpp:6:2: error: use of undeclared identifier 'ARROW_CHECK' 
> ARROW_CHECK_OK(i64builder.Finish(&i64array)); ^ 
> xxx/include/arrow/status.h:49:27: note: expanded from macro 'ARROW_CHECK_OK' 
> #define ARROW_CHECK_OK(s) ARROW_CHECK_OK_PREPEND(s, "Bad status") ^ 
> xxx/include/arrow/status.h:44:5: note: expanded from macro 
> 'ARROW_CHECK_OK_PREPEND' ARROW_CHECK(_s.ok()) << (msg) << ": " << 
> _s.ToString(); \ ^ 1 error generated.
> {code}
> I expect that ARROW_* macro are public API, and should work out of the box.
> A naive attempt to fix it
> {code}
> diff --git a/cpp/src/arrow/status.h b/cpp/src/arrow/status.h
> index 84f55e41..6da4a773 100644
> --- a/cpp/src/arrow/status.h
> +++ b/cpp/src/arrow/status.h
> @@ -25,6 +25,7 @@
>  #include "arrow/util/macros.h"
>  #include "arrow/util/visibility.h"
> +#include "arrow/util/logging.h"
>  // Return the given status if it is not OK.
>  #define ARROW_RETURN_NOT_OK(s)   \
> {code}
> fails with
> {code}
> public-api-test.cc:21:2: error: "DCHECK should not be visible from Arrow 
> public headers."
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

54 matches

Mail list logo