date:20190822

[jira] [Commented] (ARROW-5337) [C++] Add RecordBatch::field method, possibly deprecate "column"

2019-08-22 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913991#comment-16913991
 ] 

Joris Van den Bossche commented on ARROW-5337:
--

Since there is also a {{arrow::Field}} which is the combination of name, type 
and optional metadata (and not the actual array), I think using {{column()}} is 
better. If there would be a {{RecordBatch::field()}}, it should probably return 
a {{Field}} and not an array.

On the python side also this terminology is used: {{Table.column()}} returns 
the (chunked) array, {{Table.field()}} a field (as a shortcut for 
{{Table.schema.field()}}.

So I would personally close this issue.

> [C++] Add RecordBatch::field method, possibly deprecate "column"
> 
>
> Key: ARROW-5337
> URL: https://issues.apache.org/jira/browse/ARROW-5337
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> As a matter of consistency, it might be better to rename 
> {{RecordBatch::column}} to {{RecordBatch::field}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5630) [Python] Table of nested arrays doesn't round trip

2019-08-22 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913988#comment-16913988
 ] 

Joris Van den Bossche commented on ARROW-5630:
--

Yes, get the same error on latest master.

> [Python] Table of nested arrays doesn't round trip
> --
>
> Key: ARROW-5630
> URL: https://issues.apache.org/jira/browse/ARROW-5630
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: pyarrow 0.13, Windows 10
>Reporter: Philip Felton
>Priority: Major
>  Labels: parquet
> Fix For: 0.15.0
>
>
> This is pyarrow 0.13 on Windows.
> {code:python}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> def make_table(num_rows):
> typ = pa.list_(pa.field("item", pa.float32(), False))
> return pa.Table.from_arrays([
> pa.array([[0] * (i%10) for i in range(0, num_rows)], type=typ),
> pa.array([[0] * ((i+5)%10) for i in range(0, num_rows)], type=typ)
> ], ['a', 'b'])
> pq.write_table(make_table(100), 'test.parquet')
> pq.read_table('test.parquet')
> {code}
> The last line throws the following exception:
> {noformat}
> ---
> ArrowInvalid  Traceback (most recent call last)
>  in 
> > 1 pq.read_table('full.parquet')
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read_table(source, 
> columns, use_threads, metadata, use_pandas_metadata, memory_map, filesystem)
>1150 return fs.read_parquet(path, columns=columns,
>1151use_threads=use_threads, 
> metadata=metadata,
> -> 1152
> use_pandas_metadata=use_pandas_metadata)
>1153 
>1154 pf = ParquetFile(source, metadata=metadata)
> ~\Anaconda3\lib\site-packages\pyarrow\filesystem.py in read_parquet(self, 
> path, columns, metadata, schema, use_threads, use_pandas_metadata)
> 179  filesystem=self)
> 180 return dataset.read(columns=columns, use_threads=use_threads,
> --> 181 use_pandas_metadata=use_pandas_metadata)
> 182 
> 183 def open(self, path, mode='rb'):
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, 
> use_threads, use_pandas_metadata)
>1012 table = piece.read(columns=columns, 
> use_threads=use_threads,
>1013partitions=self.partitions,
> -> 1014
> use_pandas_metadata=use_pandas_metadata)
>1015 tables.append(table)
>1016 
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, 
> use_threads, partitions, open_file_func, file, use_pandas_metadata)
> 562 table = reader.read_row_group(self.row_group, **options)
> 563 else:
> --> 564 table = reader.read(**options)
> 565 
> 566 if len(self.partition_keys) > 0:
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, 
> use_threads, use_pandas_metadata)
> 212 columns, use_pandas_metadata=use_pandas_metadata)
> 213 return self.reader.read_all(column_indices=column_indices,
> --> 214 use_threads=use_threads)
> 215 
> 216 def scan_contents(self, columns=None, batch_size=65536):
> ~\Anaconda3\lib\site-packages\pyarrow\_parquet.pyx in 
> pyarrow._parquet.ParquetReader.read_all()
> ~\Anaconda3\lib\site-packages\pyarrow\error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Column 1 named b expected length 932066 but got length 932063
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6144) [C++][Gandiva] Implement random function in Gandiva

2019-08-22 Thread Prudhvi Porandla (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prudhvi Porandla updated ARROW-6144:

Description: Implement random(), random(int seed) functions. The values are 
sampled from a uniform distribution(0, 1) The random values for each row of a 
column are generated from same generator which is initialised at (function) 
build time.  (was: Implement random(), random(int seed) functions)

> [C++][Gandiva] Implement random function in Gandiva
> ---
>
> Key: ARROW-6144
> URL: https://issues.apache.org/jira/browse/ARROW-6144
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++ - Gandiva
>Reporter: Prudhvi Porandla
>Assignee: Prudhvi Porandla
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Implement random(), random(int seed) functions. The values are sampled from a 
> uniform distribution(0, 1) The random values for each row of a column are 
> generated from same generator which is initialised at (function) build time.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6331) [Java] Incorporate ErrorProne into the java build

2019-08-22 Thread Micah Kornfield (Jira)

Micah Kornfield created ARROW-6331:
--

 Summary: [Java] Incorporate ErrorProne into the java build
 Key: ARROW-6331
 URL: https://issues.apache.org/jira/browse/ARROW-6331
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, Java
Reporter: Micah Kornfield


[Using 
https://github.com/google/error-prone|https://github.com/google/error-prone] 
seems like it would be a good idea to automatically catch more errors.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-6330) [C++] Include missing headers in api.h

2019-08-22 Thread Micah Kornfield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield reassigned ARROW-6330:
--

Assignee: Micah Kornfield

> [C++] Include missing headers in api.h
> --
>
> Key: ARROW-6330
> URL: https://issues.apache.org/jira/browse/ARROW-6330
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Minor
>
> I think result.h and array/concatenate.h should be included as they export 
> public symbols.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6330) [C++] Include missing headers in api.h

2019-08-22 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6330:
--
Labels: pull-request-available  (was: )

> [C++] Include missing headers in api.h
> --
>
> Key: ARROW-6330
> URL: https://issues.apache.org/jira/browse/ARROW-6330
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Minor
>  Labels: pull-request-available
>
> I think result.h and array/concatenate.h should be included as they export 
> public symbols.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6330) [C++] Include missing headers in api.h

2019-08-22 Thread Micah Kornfield (Jira)

Micah Kornfield created ARROW-6330:
--

 Summary: [C++] Include missing headers in api.h
 Key: ARROW-6330
 URL: https://issues.apache.org/jira/browse/ARROW-6330
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Micah Kornfield


I think result.h and array/concatenate.h should be included as they export 
public symbols.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5691) [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet

2019-08-22 Thread Micah Kornfield (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913863#comment-16913863
 ] 

Micah Kornfield commented on ARROW-5691:


Given the current organization of the code base and based on [~xhochy] comment 
above.  I think we should put the core logic of reading files under the adaptor 
folders (where ORC is currently located), then consume that from datasets.  I 
don't have a good mental model of the current .so dependencies to offer a 
meaningful opinion on that. 

 

 

> [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet
> --
>
> Key: ARROW-5691
> URL: https://issues.apache.org/jira/browse/ARROW-5691
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> I think it may make sense to continue developing and maintaining this code in 
> the same place as other file format <-> Arrow serialization code and dataset 
> handling routines (e.g. schema normalization). Under this scheme, libparquet 
> becomes a link time dependency of libarrow_dataset



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-2303) [C++] Disable ASAN when building io-hdfs-test.cc

2019-08-22 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-2303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913851#comment-16913851
 ] 

Wes McKinney commented on ARROW-2303:
-

That's all that's necessary IIRC. I removed from 0.15.0 as this isn't really 
harming anything at the moment (other than being a bit unsightly)

> [C++] Disable ASAN when building io-hdfs-test.cc
> 
>
> Key: ARROW-2303
> URL: https://issues.apache.org/jira/browse/ARROW-2303
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> ASAN reports spurious memory leaks in this unit test module. I am not sure 
> the easiest way to conditionally scrub the ASAN flags from such a unit test's 
> compilation flags



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-2303) [C++] Disable ASAN when building io-hdfs-test.cc

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-2303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2303:

Fix Version/s: (was: 0.15.0)

> [C++] Disable ASAN when building io-hdfs-test.cc
> 
>
> Key: ARROW-2303
> URL: https://issues.apache.org/jira/browse/ARROW-2303
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> ASAN reports spurious memory leaks in this unit test module. I am not sure 
> the easiest way to conditionally scrub the ASAN flags from such a unit test's 
> compilation flags



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6256) [Rust] parquet-format should be released by Apache process

2019-08-22 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913848#comment-16913848
 ] 

Wes McKinney commented on ARROW-6256:
-

Unclear if this is resolvable in time for Arrow 0.15.0. You need to have a 
discussion on either the Parquet or Arrow mailing lists about how to proceed. 

> [Rust] parquet-format should be released by Apache process
> --
>
> Key: ARROW-6256
> URL: https://issues.apache.org/jira/browse/ARROW-6256
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 0.14.1
>Reporter: Andy Grove
>Priority: Major
> Fix For: 0.15.0
>
>
> The Arrow parquet crate depends on the parquet-format crate [1]. 
> Parquet-format 2.6.0 was recently released and has breaking changes compared 
> to 2.5.0.
> This means that previously published Arrow Parquet/DataFusion crates are now 
> unusable out the box [2].
> We should bring parquet-format into an Apache release process to avoid this 
> type of issue in the future.
>  
> [1] [https://github.com/sunchao/parquet-format-rs]
> [2] https://issues.apache.org/jira/browse/ARROW-6255



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5141) [C++] Share more of the IPC testing utils with the rest of Arrow

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5141:

Fix Version/s: (was: 0.15.0)
   1.0.0

> [C++] Share more of the IPC testing utils with the rest of Arrow
> 
>
> Key: ARROW-5141
> URL: https://issues.apache.org/jira/browse/ARROW-5141
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.13.0
>Reporter: Antoine Pitrou
>Priority: Minor
> Fix For: 1.0.0
>
>
> Some APIs in {{arrow/ipc/test-common.h}} aren't really IPC-specific. 
> Furthermore, {{arrow/ipc/test-common.h}} is already included in non-IPC 
> tests. Those APIs should be moved to the Arrow-wide testing utilities.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6301) [Python] atexit: pyarrow.lib.ArrowKeyError: 'No type extension with name arrow.py_extension_type found'

2019-08-22 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6301:
--
Labels: pull-request-available  (was: )

> [Python] atexit: pyarrow.lib.ArrowKeyError: 'No type extension with name 
> arrow.py_extension_type found'
> ---
>
> Key: ARROW-6301
> URL: https://issues.apache.org/jira/browse/ARROW-6301
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
> Environment: linux, virtualenv, uwsgi, cpython 2.7
>Reporter: David Alphus
>Assignee: Wes McKinney
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> On interrupt, I am frequently seeing the atexit function failing in pyarrow 
> 0.14.1.
> {code:java}
>  ^CSIGINT/SIGQUIT received...killing workers... 
> killing the spooler with pid 22640 
> Error in atexit._run_exitfuncs: 
> Traceback (most recent call last): 
>   File "/home/alpha/.virtualenvs/wsgi/lib/python2.7/atexit.py", line 24, in 
> _run_exitfuncs 
>     func(*targs, **kargs) 
>   File "pyarrow/types.pxi", line 1860, in 
> pyarrow.lib._unregister_py_extension_type 
>     check_status(UnregisterPyExtensionType()) 
>   File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status 
>     raise ArrowKeyError(message) 
> ArrowKeyError: 'No type extension with name arrow.py_extension_type found' 
> Error in sys.exitfunc: 
> Traceback (most recent call last): 
>   File "/home/alpha/.virtualenvs/wsgi/lib/python2.7/atexit.py", line 24, in 
> _run_exitfuncs 
>     func(*targs, **kargs) 
>   File "pyarrow/types.pxi", line 1860, in 
> pyarrow.lib._unregister_py_extension_type 
>   File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status 
> pyarrow.lib.ArrowKeyError: 'No type extension with name 
> arrow.py_extension_type found' 
> spooler (pid: 22640) annihilated 
> worker 1 buried after 1 seconds 
> goodbye to uWSGI.{code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-412) [Format] Handling of buffer padding in the IPC metadata

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-412:
---
Fix Version/s: (was: 0.15.0)
   1.0.0

> [Format] Handling of buffer padding in the IPC metadata
> ---
>
> Key: ARROW-412
> URL: https://issues.apache.org/jira/browse/ARROW-412
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> See discussion in ARROW-399. Do we include padding bytes in the metadata or 
> set the actual used bytes? In the latter case, the padding would be a part of 
> the format (any buffers continue to be expected to be 64-byte padded, to 
> permit AVX512 instructions)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Comment Edited] (ARROW-5508) [C++] Create reusable Iterator interface

2019-08-22 Thread Francois Saint-Jacques (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913838#comment-16913838
 ] 

Francois Saint-Jacques edited comment on ARROW-5508 at 8/23/19 1:17 AM:


My take after implementing MapIterator, FlattenIterator and using it heavily in 
the dataset code.

# `T` must be of pointer type (or support assignment/comparison of nullptr). 
Iterator completion is signaled by assigning `T* out` to nullptr.
# Due to previous point, the iterator may never yield nullptr as a valid value.
# The interface forces consuming a value to know if it's empty, i.e. there's no 
Done()/HasNext(). This can lead to 
[odd|https://github.com/fsaintjacques/arrow/commit/36ba801f47a1053c292fd461afd4ec23e63c1e97#diff-df9646433131d9cf9f31a395c2719b70R157-R190]
 consumption.
# I question the user of Status as a return code, maybe we should have a 
specialized `FailableIterator : Iterator>` for the same effect.

The first and second point could be tackled by returning `Option` (Result 
wouldn't work because we can't use Status::OK() as a sentinel-completion 
value). The third is annoying for streaming iterators (when there's no way to 
know completion without side effect), since the iterator itself must consume on 
Done() call and cache the result. I think I prefer putting the onus on the 
iterator implementor than the user of the interface.


was (Author: fsaintjacques):
My take after implementing MapIterator, FlattenIterator.

# `T` must be of pointer type (or support assignment/comparison of nullptr). 
Iterator completion is signaled by assigning `T* out` to nullptr.
# Due to previous point, the iterator may never yield nullptr as a valid value.
# The interface forces consuming a value to know if it's empty, i.e. there's no 
Done()/HasNext(). This can lead to 
[odd|https://github.com/fsaintjacques/arrow/commit/36ba801f47a1053c292fd461afd4ec23e63c1e97#diff-df9646433131d9cf9f31a395c2719b70R157-R190]
 consumption.
# I question the user of Status as a return code, maybe we should have a 
specialized `FailableIterator : Iterator>` for the same effect.

The first and second point could be tackled by returning `Option` (Result 
wouldn't work because we can't use Status::OK() as a sentinel-completion 
value). The third is annoying for streaming iterators (when there's no way to 
know completion without side effect), since the iterator itself must consume on 
Done() call and cache the result. I think I prefer putting the onus on the 
iterator implementor than the user of the interface.

> [C++] Create reusable Iterator interface 
> 
>
> Key: ARROW-5508
> URL: https://issues.apache.org/jira/browse/ARROW-5508
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> We have various iterator-like classes. I envision a reusable interface like
> {code}
> template 
> class Iterator {
>  public:
>   virtual ~Iterator() = default;
>   virtual Status Next(T* out) = 0;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5508) [C++] Create reusable Iterator interface

2019-08-22 Thread Francois Saint-Jacques (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913838#comment-16913838
 ] 

Francois Saint-Jacques commented on ARROW-5508:
---

My take after implementing MapIterator, FlattenIterator.

# `T` must be of pointer type (or support assignment/comparison of nullptr). 
Iterator completion is signaled by assigning `T* out` to nullptr.
# Due to previous point, the iterator may never yield nullptr as a valid value.
# The interface forces consuming a value to know if it's empty, i.e. there's no 
Done()/HasNext(). This can lead to 
[odd|https://github.com/fsaintjacques/arrow/commit/36ba801f47a1053c292fd461afd4ec23e63c1e97#diff-df9646433131d9cf9f31a395c2719b70R157-R190]
 consumption.
# I question the user of Status as a return code, maybe we should have a 
specialized `FailableIterator : Iterator>` for the same effect.

The first and second point could be tackled by returning `Option` (Result 
wouldn't work because we can't use Status::OK() as a sentinel-completion 
value). The third is annoying for streaming iterators (when there's no way to 
know completion without side effect), since the iterator itself must consume on 
Done() call and cache the result. I think I prefer putting the onus on the 
iterator implementor than the user of the interface.

> [C++] Create reusable Iterator interface 
> 
>
> Key: ARROW-5508
> URL: https://issues.apache.org/jira/browse/ARROW-5508
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> We have various iterator-like classes. I envision a reusable interface like
> {code}
> template 
> class Iterator {
>  public:
>   virtual ~Iterator() = default;
>   virtual Status Next(T* out) = 0;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-6328) Click.option-s should have help text

2019-08-22 Thread Sutou Kouhei (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sutou Kouhei reassigned ARROW-6328:
---

Assignee: Ulzii O

> Click.option-s should have help text
> 
>
> Key: ARROW-6328
> URL: https://issues.apache.org/jira/browse/ARROW-6328
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Ulzii O
>Assignee: Ulzii O
>Priority: Trivial
>
> Click.option-s should have `help` text
> ## What?
> Add `help` text to click.option
> ## Why?
> Click.option should ideally have a `help` text defined to be useful.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-3531) [Python] Deprecate Schema.field_by_name in favor of getitem

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-3531.
-
Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5165
[https://github.com/apache/arrow/pull/5165]

> [Python] Deprecate Schema.field_by_name in favor of __getitem__ 
> 
>
> Key: ARROW-3531
> URL: https://issues.apache.org/jira/browse/ARROW-3531
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Similarly like https://github.com/apache/arrow/pull/2754



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-3531) [Python] Deprecate Schema.field_by_name in favor of getitem

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-3531:
---

Assignee: Joris Van den Bossche

> [Python] Deprecate Schema.field_by_name in favor of __getitem__ 
> 
>
> Key: ARROW-3531
> URL: https://issues.apache.org/jira/browse/ARROW-3531
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Krisztian Szucs
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Similarly like https://github.com/apache/arrow/pull/2754



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6178) [Developer] Don't fail in merge script on bad primary author input in multi-author PRs

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6178.
-
Resolution: Fixed

Issue resolved by pull request 5149
[https://github.com/apache/arrow/pull/5149]

> [Developer] Don't fail in merge script on bad primary author input in 
> multi-author PRs
> --
>
> Key: ARROW-6178
> URL: https://issues.apache.org/jira/browse/ARROW-6178
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Developer Tools
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> I was going on autopilot in a multi-author PR and this happened
> {code}
> Switched to branch 'PR_TOOL_MERGE_PR_5000_MASTER'
> Automatic merge went well; stopped before committing as requested
> Author 1: François Saint-Jacques 
> Author 2: Wes McKinney 
> Enter primary author in the format of "name " [François Saint-Jacques 
> ]: y
> fatal: --author '"y"' is not 'Name ' and matches no existing author
> Command failed: ['git', 'commit', '--no-verify', '--author="y"', '-m', 
> 'ARROW-6121: [Tools] Improve merge tool ergonomics', '-m', '- 
> merge_arrow_pr.py now accepts the pull-request number as a single optional 
> argument, e.g. `./merge_arrow_pr.py 4921`.\r\n- merge_arrow_pr.py can 
> optionally read a configuration file located in   
> `~/.config/arrow/merge.conf` which contains options like jira credentials. 
> See the `dev/merge.conf` file as example', '-m', 'Closes #5000 from 
> fsaintjacques/ARROW-6121-merge-ergonomic and squashes the following 
> commits:', '-m', '5298308d7  Handle username/password 
> separately (in case username is set but not password)\n581653735  Saint-Jacques> Rename merge.conf to merge.conf.sample\n7c51ca8f0  Saint-Jacques> Add license to config file\n1213946bd  
> ARROW-6121:  Improve merge tool ergonomics', '-m', 'Lead-authored-by: 
> y\nCo-authored-by: François Saint-Jacques 
> \nCo-authored-by: Wes McKinney 
> \nSigned-off-by: Wes McKinney ']
> With output:
> --
> b''
> --
> Traceback (most recent call last):
>   File "dev/merge_arrow_pr.py", line 530, in 
> if pr.is_merged:
>   File "dev/merge_arrow_pr.py", line 515, in cli
> PROJECT_NAME = os.environ.get('ARROW_PROJECT_NAME') or 'arrow'
>   File "dev/merge_arrow_pr.py", line 420, in merge
> '--author="%s"' % primary_author] +
>   File "dev/merge_arrow_pr.py", line 89, in run_cmd
> print('--')
>   File "dev/merge_arrow_pr.py", line 81, in run_cmd
> try:
>   File "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/subprocess.py", 
> line 395, in check_output
> **kwargs).stdout
>   File "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/subprocess.py", 
> line 487, in run
> output=stdout, stderr=stderr)
> {code}
> If the input does not match the expected format, we should loop to request 
> input again rather than failing out (which requires messy manual cleanup of 
> temporary branches)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6329) [Format] Add 4-byte "stream continuation" to IPC message format to align Flatbuffers

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6329.
-
Resolution: Fixed

Issue resolved by pull request 4951
[https://github.com/apache/arrow/pull/4951]

> [Format] Add 4-byte "stream continuation" to IPC message format to align 
> Flatbuffers
> 
>
> Key: ARROW-6329
> URL: https://issues.apache.org/jira/browse/ARROW-6329
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Wes McKinney
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> This is the JIRA corresponding to the mailing list discussion



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6329) [Format] Add 4-byte "stream continuation" to IPC message format to align Flatbuffers

2019-08-22 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6329:
--
Labels: pull-request-available  (was: )

> [Format] Add 4-byte "stream continuation" to IPC message format to align 
> Flatbuffers
> 
>
> Key: ARROW-6329
> URL: https://issues.apache.org/jira/browse/ARROW-6329
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Wes McKinney
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> This is the JIRA corresponding to the mailing list discussion



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6329) [Format] Add 4-byte "stream continuation" to IPC message format to align Flatbuffers

2019-08-22 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-6329:
---

 Summary: [Format] Add 4-byte "stream continuation" to IPC message 
format to align Flatbuffers
 Key: ARROW-6329
 URL: https://issues.apache.org/jira/browse/ARROW-6329
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Format
Reporter: Wes McKinney
Assignee: Micah Kornfield
 Fix For: 0.15.0


This is the JIRA corresponding to the mailing list discussion



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6127) [Website] Refresh website theme

2019-08-22 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6127:
---
Fix Version/s: 1.0.0

> [Website] Refresh website theme
> ---
>
> Key: ARROW-6127
> URL: https://issues.apache.org/jira/browse/ARROW-6127
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 1.0.0
>
>
> Among the things I noticed recently that should be easy to clean up:
>  * We should supply a favicon
>  * The  is the same for every page and it always says "Apache Arrow 
> Homepage"
>  * There are no opengraph or twitter card meta tags, so there's no link 
> preview
>  * The version of bootstrap used is not current and has been flagged as a 
> possible security vulnerability
> Much of this could just be fixed by porting to a modern Hugo template, which 
> I'll explore.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5505) [R] Stop masking base R functions/rethink namespacing

2019-08-22 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-5505:
---
Fix Version/s: (was: 1.0.0)
   0.15.0

> [R] Stop masking base R functions/rethink namespacing
> -
>
> Key: ARROW-5505
> URL: https://issues.apache.org/jira/browse/ARROW-5505
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> The package startup message about masking base functions can be scary. We 
> should avoid masking base functions without a compelling reason (i.e. let's 
> do arrow_array() instead of array(), arrow_table()). The arrow versions do 
> very different things than the base functions; plus, end users shouldn’t be 
> dealing directly with Tables and Arrays, so they don’t need to figure so 
> prominently in the public API of the package.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5686) [R] Review R Windows CI build

2019-08-22 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-5686:
---
Fix Version/s: 0.15.0

> [R] Review R Windows CI build
> -
>
> Key: ARROW-5686
> URL: https://issues.apache.org/jira/browse/ARROW-5686
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Followup to ARROW-3758 / [https://github.com/apache/arrow/pull/4622]. In 
> that, I leveraged the tools in 
> [https://github.com/r-windows/rtools-backports] to set up CI for Arrow C++ 
> and R on Windows using Appveyor. I was guided mainly by the steps described 
> [here|https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-BuildingWindowspackages]
>  on the Arrow project wiki and iterated until I got a passing build.
> Despite getting it to "work", I'm certain I've missed some subtleties, and 
> there may be better ways to accomplish this. Some specific questions:
>  * I found that I could ignore rtools-backports/ci-library.sh and most of 
> ci-build.sh because it was oriented around building possibly many packages, 
> but there was a block of {{pacman}} stuff I did have to copy here: 
> [https://github.com/apache/arrow/pull/4622/files#diff-f4a8bedb9b0d3fe301a84914916f6d49R22].
>  I'm not sure how much these are likely to change, but if that's a concern, 
> maybe that setup could be factored out to a separate shell script in 
> rtools-backports, and the arrow CI could {{wget}} and {{source}} it like it 
> does some other resources. That way, our setup here wouldn't diverge.
>  * I did not understand what I needed to do with rtools-packages, if 
> anything. It seems that it's not used by R yet, so is it just important to 
> have the PKGBUILD in place there for when is ready? If I wanted to build both 
> rtools-backports and rtools-packages builds in the same job, is the 
> difference only [these environment 
> variables|https://github.com/r-windows/rtools-backports/blob/master/mingw-w64-arrow/PKGBUILD#L48-L52]?
>  * The process of taking the appveyor build artifacts, unzipping them, and 
> merging them into the "rwinlib" directory layout seemed loose and poorly 
> defined on the wiki, at least as I could tell. I packaged up the process (as 
> I understood it) in a [shell 
> script|https://github.com/apache/arrow/pull/4622/files#diff-c043cda9f4ed847b06efeeacf04634ee],
>  and it produced a zip file that is the right shape (right enough that R 
> could install the arrow R package with it and run tests). Does that script 
> make sense? In particular,
>  ** Is there a good way to keep around the other dependencies 
> (double-conversion, boost, thrift) from when the packages are built so that I 
> don't have to re-download them from bintray? I see that they get pulled down 
> at the beginning of each pkgbuild and then removed after, but I don't know 
> where they are put such that I could keep them around and use them later.
>  ** Is the {{lib}} directory for other dependencies (e.g. 
> libdouble-conversion.a) and {{lib-4.9.3}} for the arrow and parquet binaries 
> we build, as the wiki says? Or is {{lib}} for the Rtools4.0/gcc8 versions and 
> lib-4.9.3 for the Rtools3.5/gcc4 versions? 
>  ** libdouble-conversion.a only seems to exist in the rtools-packages 
> Rtools4.0 packages, but that nevertheless works on the R release version. 
> However, if I used the versions of boost and thrift from the Rtools4.0 
> bintrays, the R package did not build (link) correctly.
> To be clear, it is not our intention to fork or otherwise avoid the supported 
> Rtools toolchain that is maintained there; rather, we want to continuously 
> integrate arrow to avoid breaking things and make it easier to submit updates 
> to rtools-backports/packages/rwinlib when there's a new arrow release. We 
> want as much as possible to use the supported tools and workflows and are 
> willing to contribute to enhancing them, though we recognize that our needs 
> (as a big C++ library under heavy active development) are probably not shared 
> by many other projects that use rtools-packages et al.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6260) [Website] Use deploy key on Travis to build and push to asf-site

2019-08-22 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913791#comment-16913791
 ] 

Neal Richardson commented on ARROW-6260:


INFRA-18924 is the ticket for setting the keys in the right places.

> [Website] Use deploy key on Travis to build and push to asf-site
> 
>
> Key: ARROW-6260
> URL: https://issues.apache.org/jira/browse/ARROW-6260
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> ARROW-4473 added CI/CD for the website, but there was some discomfort about 
> having a committer provide a GitHub personal access token to do the pushing 
> of the built site to the asf-site branch. Investigate using GitHub Deploy 
> Keys instead, which are scoped to a single repository, not all public 
> repositories that a user has access to.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5176) [Python] Automate formatting of python files

2019-08-22 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5176:
--
Labels: pull-request-available  (was: )

> [Python] Automate formatting of python files
> 
>
> Key: ARROW-5176
> URL: https://issues.apache.org/jira/browse/ARROW-5176
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Benjamin Kietzman
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> [Black](https://github.com/ambv/black) is a tool for automatically formatting 
> python code in ways which flake8 and our other linters approve of. Adding it 
> to the project will allow more reliably formatted python code and fill a 
> similar role to {{clang-format}} for c++ and {{cmake-format}} for cmake



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5176) [Python] Automate formatting of python files

2019-08-22 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913786#comment-16913786
 ] 

Neal Richardson commented on ARROW-5176:


I just made a PR with the blackened codebase, for consideration.

> [Python] Automate formatting of python files
> 
>
> Key: ARROW-5176
> URL: https://issues.apache.org/jira/browse/ARROW-5176
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Benjamin Kietzman
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [Black](https://github.com/ambv/black) is a tool for automatically formatting 
> python code in ways which flake8 and our other linters approve of. Adding it 
> to the project will allow more reliably formatted python code and fill a 
> similar role to {{clang-format}} for c++ and {{cmake-format}} for cmake



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-5176) [Python] Automate formatting of python files

2019-08-22 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-5176:
--

Assignee: Neal Richardson

> [Python] Automate formatting of python files
> 
>
> Key: ARROW-5176
> URL: https://issues.apache.org/jira/browse/ARROW-5176
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Benjamin Kietzman
>Assignee: Neal Richardson
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [Black](https://github.com/ambv/black) is a tool for automatically formatting 
> python code in ways which flake8 and our other linters approve of. Adding it 
> to the project will allow more reliably formatted python code and fill a 
> similar role to {{clang-format}} for c++ and {{cmake-format}} for cmake



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-4836) [Python] "Cannot tell() a compressed stream" when using RecordBatchStreamWriter

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4836:

Summary: [Python] "Cannot tell() a compressed stream" when using 
RecordBatchStreamWriter  (was: "Cannot tell() a compressed stream" when using 
RecordBatchStreamWriter)

> [Python] "Cannot tell() a compressed stream" when using 
> RecordBatchStreamWriter
> ---
>
> Key: ARROW-4836
> URL: https://issues.apache.org/jira/browse/ARROW-4836
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.1
>Reporter: Mike Pedersen
>Priority: Major
> Fix For: 0.15.0
>
>
> It does not seem like RecordBatchStreamWriter works with compressed streams:
> {code:python}
> >>> import pyarrow as pa
> >>> pa.__version__
> '0.12.1'
> >>> stream = pa.output_stream('/tmp/a.gz')
> >>> batch = pa.RecordBatch.from_arrays([pa.array([1])], ['a'])
> >>> writer = pa.RecordBatchStreamWriter(stream, batch.schema)
> >>> writer.write(batch)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/ipc.pxi", line 181, in pyarrow.lib._RecordBatchWriter.write
>   File "pyarrow/ipc.pxi", line 196, in 
> pyarrow.lib._RecordBatchWriter.write_batch
>   File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: Cannot tell() a compressed stream
> {code}
> As I understand the documentation, this should be possible, right?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5417) [Website] http://arrow.apache.org doesn't redirect to https

2019-08-22 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913783#comment-16913783
 ] 

Neal Richardson commented on ARROW-5417:


I'll close this since it's reported there and there's nada we can do about it 
here.

> [Website] http://arrow.apache.org doesn't redirect to https
> ---
>
> Key: ARROW-5417
> URL: https://issues.apache.org/jira/browse/ARROW-5417
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Neal Richardson
>Priority: Minor
>
> This should be a simple (for someone sufficiently authorized) config change 
> somewhere.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Closed] (ARROW-5417) [Website] http://arrow.apache.org doesn't redirect to https

2019-08-22 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson closed ARROW-5417.
--
  Assignee: Neal Richardson
Resolution: Invalid

> [Website] http://arrow.apache.org doesn't redirect to https
> ---
>
> Key: ARROW-5417
> URL: https://issues.apache.org/jira/browse/ARROW-5417
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Minor
>
> This should be a simple (for someone sufficiently authorized) config change 
> somewhere.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5417) [Website] http://arrow.apache.org doesn't redirect to https

2019-08-22 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913779#comment-16913779
 ] 

Neal Richardson commented on ARROW-5417:


https://issues.apache.org/jira/browse/INFRA-18923

> [Website] http://arrow.apache.org doesn't redirect to https
> ---
>
> Key: ARROW-5417
> URL: https://issues.apache.org/jira/browse/ARROW-5417
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Neal Richardson
>Priority: Minor
>
> This should be a simple (for someone sufficiently authorized) config change 
> somewhere.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5703) [Python] Arrow cannot serialize complex scalars in numpy

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5703:

Summary: [Python] Arrow cannot serialize complex scalars in numpy  (was: 
Arrow cannot serialize complex scalars in numpy)

> [Python] Arrow cannot serialize complex scalars in numpy
> 
>
> Key: ARROW-5703
> URL: https://issues.apache.org/jira/browse/ARROW-5703
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.1
> Environment: Miniconda Python 3.7.3
> Numpy 1.16.4
>Reporter: Wenjun Si
>Priority: Minor
>
> When serializing complex scalars in Numpy, PyArrow reports an 
> ArrowNotImplementedError. This can be reproduced with the code below
> {code:python}
> import numpy as np
> import pyarrow as pa
> # this raises ArrowNotImplementedError: Numpy scalar type not recognized
> pa.serialize(np.complex128(1+3j))
> # this runs ok
> pa.serialize(np.array([1+3j]))
> {code}
> and the stack is
> {code}
> ---
> ArrowNotImplementedError  Traceback (most recent call last)
>  in 
> > 1 pa.serialize(np.complex128(1+3j))
> ~/miniconda3/lib/python3.7/site-packages/pyarrow/serialization.pxi in 
> pyarrow.lib.serialize()
> ~/miniconda3/lib/python3.7/site-packages/pyarrow/error.pxi in 
> pyarrow.lib.check_status()
> ArrowNotImplementedError: Numpy scalar type not recognized
> {code}
> When investigating into cpp/src/arrow/python/serialize.cc we found that the 
> function AppendScalar does not consider complex types.
> Related github discussions: https://github.com/mars-project/mars/issues/476



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5691) [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet

2019-08-22 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913778#comment-16913778
 ] 

Wes McKinney commented on ARROW-5691:
-

[~emkornfi...@gmail.com] [~pitrou] thoughts about organizing our various file 
format interfaces (CSV, JSON, ORC, Parquet, eventually Avro)? Wherever we 
organize the code, we have to have it build in a shared library separate from 
{{libarrow.so}} since it will need to depend on e.g. {{libparquet.so}} (which 
in turn depends on {{libarrow.so}})?

> [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet
> --
>
> Key: ARROW-5691
> URL: https://issues.apache.org/jira/browse/ARROW-5691
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> I think it may make sense to continue developing and maintaining this code in 
> the same place as other file format <-> Arrow serialization code and dataset 
> handling routines (e.g. schema normalization). Under this scheme, libparquet 
> becomes a link time dependency of libarrow_dataset



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5417) [Website] http://arrow.apache.org doesn't redirect to https

2019-08-22 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913776#comment-16913776
 ] 

Neal Richardson commented on ARROW-5417:


I'll make the INFRA ticket and see what they say.

> [Website] http://arrow.apache.org doesn't redirect to https
> ---
>
> Key: ARROW-5417
> URL: https://issues.apache.org/jira/browse/ARROW-5417
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Neal Richardson
>Priority: Minor
>
> This should be a simple (for someone sufficiently authorized) config change 
> somewhere.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5682) [Python] from_pandas conversion casts values to string inconsistently

2019-08-22 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913775#comment-16913775
 ] 

Wes McKinney commented on ARROW-5682:
-

This is a bit weird. I added to 0.15.0 in case we want to fix this

> [Python] from_pandas conversion casts values to string inconsistently
> -
>
> Key: ARROW-5682
> URL: https://issues.apache.org/jira/browse/ARROW-5682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
>Reporter: Bryan Cutler
>Priority: Minor
> Fix For: 0.15.0
>
>
> When calling {{pa.Array.from_pandas}} primitive data as input, and casting to 
> string with  "type=pa.string()", the resulting pyarrow Array can have 
> inconsistent values. For most input, the result is an empty string, however 
> for some types (int32, int64) the values are '\x01' etc.
> {noformat}
> In [8]: s = pd.Series([1, 2, 3], dtype=np.uint8)
> In [9]: pa.Array.from_pandas(s, type=pa.string()) 
>
> Out[9]: 
> 
> [
>   "",
>   "",
>   ""
> ]
> In [10]: s = pd.Series([1, 2, 3], dtype=np.uint32)
>
> In [11]: pa.Array.from_pandas(s, type=pa.string())
>
> Out[11]: 
> 
> [
>   "",
>   "",
>   ""
> ]
> {noformat}
> This came from the Spark discussion 
> https://github.com/apache/spark/pull/24930/files#r296187903. Type casting 
> this way in Spark is not supported, but it would be good to get the behavior 
> consistent. Would it be better to raise an UnsupportedOperation error?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5682) [Python] from_pandas conversion casts values to string inconsistently

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5682:

Fix Version/s: 0.15.0

> [Python] from_pandas conversion casts values to string inconsistently
> -
>
> Key: ARROW-5682
> URL: https://issues.apache.org/jira/browse/ARROW-5682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
>Reporter: Bryan Cutler
>Priority: Minor
> Fix For: 0.15.0
>
>
> When calling {{pa.Array.from_pandas}} primitive data as input, and casting to 
> string with  "type=pa.string()", the resulting pyarrow Array can have 
> inconsistent values. For most input, the result is an empty string, however 
> for some types (int32, int64) the values are '\x01' etc.
> {noformat}
> In [8]: s = pd.Series([1, 2, 3], dtype=np.uint8)
> In [9]: pa.Array.from_pandas(s, type=pa.string()) 
>
> Out[9]: 
> 
> [
>   "",
>   "",
>   ""
> ]
> In [10]: s = pd.Series([1, 2, 3], dtype=np.uint32)
>
> In [11]: pa.Array.from_pandas(s, type=pa.string())
>
> Out[11]: 
> 
> [
>   "",
>   "",
>   ""
> ]
> {noformat}
> This came from the Spark discussion 
> https://github.com/apache/spark/pull/24930/files#r296187903. Type casting 
> this way in Spark is not supported, but it would be good to get the behavior 
> consistent. Would it be better to raise an UnsupportedOperation error?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5505) [R] Stop masking base R functions/rethink namespacing

2019-08-22 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5505:
--
Labels: pull-request-available  (was: )

> [R] Stop masking base R functions/rethink namespacing
> -
>
> Key: ARROW-5505
> URL: https://issues.apache.org/jira/browse/ARROW-5505
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> The package startup message about masking base functions can be scary. We 
> should avoid masking base functions without a compelling reason (i.e. let's 
> do arrow_array() instead of array(), arrow_table()). The arrow versions do 
> very different things than the base functions; plus, end users shouldn’t be 
> dealing directly with Tables and Arrays, so they don’t need to figure so 
> prominently in the public API of the package.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-5505) [R] Stop masking base R functions/rethink namespacing

2019-08-22 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-5505:
--

Assignee: Neal Richardson

> [R] Stop masking base R functions/rethink namespacing
> -
>
> Key: ARROW-5505
> URL: https://issues.apache.org/jira/browse/ARROW-5505
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> The package startup message about masking base functions can be scary. We 
> should avoid masking base functions without a compelling reason (i.e. let's 
> do arrow_array() instead of array(), arrow_table()). The arrow versions do 
> very different things than the base functions; plus, end users shouldn’t be 
> dealing directly with Tables and Arrays, so they don’t need to figure so 
> prominently in the public API of the package.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5630) [Python] Table of nested arrays doesn't round trip

2019-08-22 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913774#comment-16913774
 ] 

Wes McKinney commented on ARROW-5630:
-

Added to 0.15.0. Is this still an issue?

> [Python] Table of nested arrays doesn't round trip
> --
>
> Key: ARROW-5630
> URL: https://issues.apache.org/jira/browse/ARROW-5630
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: pyarrow 0.13, Windows 10
>Reporter: Philip Felton
>Priority: Major
>  Labels: parquet
> Fix For: 0.15.0
>
>
> This is pyarrow 0.13 on Windows.
> {code:python}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> def make_table(num_rows):
> typ = pa.list_(pa.field("item", pa.float32(), False))
> return pa.Table.from_arrays([
> pa.array([[0] * (i%10) for i in range(0, num_rows)], type=typ),
> pa.array([[0] * ((i+5)%10) for i in range(0, num_rows)], type=typ)
> ], ['a', 'b'])
> pq.write_table(make_table(100), 'test.parquet')
> pq.read_table('test.parquet')
> {code}
> The last line throws the following exception:
> {noformat}
> ---
> ArrowInvalid  Traceback (most recent call last)
>  in 
> > 1 pq.read_table('full.parquet')
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read_table(source, 
> columns, use_threads, metadata, use_pandas_metadata, memory_map, filesystem)
>1150 return fs.read_parquet(path, columns=columns,
>1151use_threads=use_threads, 
> metadata=metadata,
> -> 1152
> use_pandas_metadata=use_pandas_metadata)
>1153 
>1154 pf = ParquetFile(source, metadata=metadata)
> ~\Anaconda3\lib\site-packages\pyarrow\filesystem.py in read_parquet(self, 
> path, columns, metadata, schema, use_threads, use_pandas_metadata)
> 179  filesystem=self)
> 180 return dataset.read(columns=columns, use_threads=use_threads,
> --> 181 use_pandas_metadata=use_pandas_metadata)
> 182 
> 183 def open(self, path, mode='rb'):
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, 
> use_threads, use_pandas_metadata)
>1012 table = piece.read(columns=columns, 
> use_threads=use_threads,
>1013partitions=self.partitions,
> -> 1014
> use_pandas_metadata=use_pandas_metadata)
>1015 tables.append(table)
>1016 
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, 
> use_threads, partitions, open_file_func, file, use_pandas_metadata)
> 562 table = reader.read_row_group(self.row_group, **options)
> 563 else:
> --> 564 table = reader.read(**options)
> 565 
> 566 if len(self.partition_keys) > 0:
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, 
> use_threads, use_pandas_metadata)
> 212 columns, use_pandas_metadata=use_pandas_metadata)
> 213 return self.reader.read_all(column_indices=column_indices,
> --> 214 use_threads=use_threads)
> 215 
> 216 def scan_contents(self, columns=None, batch_size=65536):
> ~\Anaconda3\lib\site-packages\pyarrow\_parquet.pyx in 
> pyarrow._parquet.ParquetReader.read_all()
> ~\Anaconda3\lib\site-packages\pyarrow\error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Column 1 named b expected length 932066 but got length 932063
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5630) [Python] Table of nested arrays doesn't round trip

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5630:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [Python] Table of nested arrays doesn't round trip
> --
>
> Key: ARROW-5630
> URL: https://issues.apache.org/jira/browse/ARROW-5630
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: pyarrow 0.13, Windows 10
>Reporter: Philip Felton
>Priority: Major
>  Labels: parquet
> Fix For: 0.15.0
>
>
> This is pyarrow 0.13 on Windows.
> {code:python}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> def make_table(num_rows):
> typ = pa.list_(pa.field("item", pa.float32(), False))
> return pa.Table.from_arrays([
> pa.array([[0] * (i%10) for i in range(0, num_rows)], type=typ),
> pa.array([[0] * ((i+5)%10) for i in range(0, num_rows)], type=typ)
> ], ['a', 'b'])
> pq.write_table(make_table(100), 'test.parquet')
> pq.read_table('test.parquet')
> {code}
> The last line throws the following exception:
> {noformat}
> ---
> ArrowInvalid  Traceback (most recent call last)
>  in 
> > 1 pq.read_table('full.parquet')
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read_table(source, 
> columns, use_threads, metadata, use_pandas_metadata, memory_map, filesystem)
>1150 return fs.read_parquet(path, columns=columns,
>1151use_threads=use_threads, 
> metadata=metadata,
> -> 1152
> use_pandas_metadata=use_pandas_metadata)
>1153 
>1154 pf = ParquetFile(source, metadata=metadata)
> ~\Anaconda3\lib\site-packages\pyarrow\filesystem.py in read_parquet(self, 
> path, columns, metadata, schema, use_threads, use_pandas_metadata)
> 179  filesystem=self)
> 180 return dataset.read(columns=columns, use_threads=use_threads,
> --> 181 use_pandas_metadata=use_pandas_metadata)
> 182 
> 183 def open(self, path, mode='rb'):
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, 
> use_threads, use_pandas_metadata)
>1012 table = piece.read(columns=columns, 
> use_threads=use_threads,
>1013partitions=self.partitions,
> -> 1014
> use_pandas_metadata=use_pandas_metadata)
>1015 tables.append(table)
>1016 
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, 
> use_threads, partitions, open_file_func, file, use_pandas_metadata)
> 562 table = reader.read_row_group(self.row_group, **options)
> 563 else:
> --> 564 table = reader.read(**options)
> 565 
> 566 if len(self.partition_keys) > 0:
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, 
> use_threads, use_pandas_metadata)
> 212 columns, use_pandas_metadata=use_pandas_metadata)
> 213 return self.reader.read_all(column_indices=column_indices,
> --> 214 use_threads=use_threads)
> 215 
> 216 def scan_contents(self, columns=None, batch_size=65536):
> ~\Anaconda3\lib\site-packages\pyarrow\_parquet.pyx in 
> pyarrow._parquet.ParquetReader.read_all()
> ~\Anaconda3\lib\site-packages\pyarrow\error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Column 1 named b expected length 932066 but got length 932063
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5580) [C++][Gandiva] Correct definitions of timestamp functions in Gandiva

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5580:

Summary: [C++][Gandiva] Correct definitions of timestamp functions in 
Gandiva  (was: Correct definitions of timestamp functions in Gandiva)

> [C++][Gandiva] Correct definitions of timestamp functions in Gandiva
> 
>
> Key: ARROW-5580
> URL: https://issues.apache.org/jira/browse/ARROW-5580
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++ - Gandiva
>Reporter: Prudhvi Porandla
>Assignee: Prudhvi Porandla
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Timestamp functions are unsupported in Gandiva due to definition mismatch.
> For example, Gandiva supports timestampAddMonth(timestamp, int32) but the 
> expected signature is  timestampAddMonth(int32, timestamp).
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5575) [C++] arrowConfig.cmake includes uninstalled targets

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5575:

Summary: [C++] arrowConfig.cmake includes uninstalled targets  (was: 
arrowConfig.cmake includes uninstalled targets)

> [C++] arrowConfig.cmake includes uninstalled targets
> 
>
> Key: ARROW-5575
> URL: https://issues.apache.org/jira/browse/ARROW-5575
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.13.0, 0.14.0, 0.14.1
>Reporter: Matthijs Brobbel
>Priority: Minor
>
> I'm building a CMake project against arrow and I'm using:
> {code:java}
> find_package(arrow 0.13 CONFIG REQUIRED)
> {code}
> to get the arrow_shared target in scope. This works for me on macOS. I 
> installed apache-arrow with:
> {code:java}
> brew install apache-arrow{code}
> However, when I attempt to build the project in a ubuntu xenial container, I 
> get the following CMake error:
> {code:java}
> CMake Error at /usr/lib/x86_64-linux-gnu/cmake/arrow/arrowTargets.cmake:151 
> (message):
> The imported target "arrow_cuda_shared" references the file
> "/usr/lib/x86_64-linux-gnu/libarrow_cuda.so.13.0.0"
> but this file does not exist. Possible reasons include:
> * The file was deleted, renamed, or moved to another location.
> * An install or uninstall procedure did not complete successfully.
> * The installation package was faulty and contained
> "/usr/lib/x86_64-linux-gnu/cmake/arrow/arrowTargets.cmake"
> but not all the files it references.
> Call Stack (most recent call first):
> /usr/lib/x86_64-linux-gnu/cmake/arrow/arrowConfig.cmake:61 (include)
> CMakeLists.txt:15 (find_package)
> {code}
> I installed arrow with:
> {code:java}
> curl -sSL "https://dist.apache.org/repos/dist/dev/arrow/KEYS"; | apt-key add -
> echo "deb [arch=amd64] https://dl.bintray.com/apache/arrow/ubuntu/ xenial 
> main" | tee -a /etc/apt/sources.list
> apt-get update
> apt-get install -y libarrow-dev=0.13.0-1
> {code}
> I can also install libarrow-cuda-dev, but I don't want to because I don't 
> need it.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5562) [C++] parquet writer does not handle negative zero correctly

2019-08-22 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913773#comment-16913773
 ] 

Wes McKinney commented on ARROW-5562:
-

Odd issue. I added to 0.15.0 in case someone can take a look

> [C++] parquet writer does not handle negative zero correctly
> 
>
> Key: ARROW-5562
> URL: https://issues.apache.org/jira/browse/ARROW-5562
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.13.0
>Reporter: Bob Briody
>Priority: Major
>  Labels: parquet
> Fix For: 0.15.0
>
>
>  
> I have the following csv file: (Note that {{col_a}} contains a negative zero 
> value.)
> {code:java}
> col_a,col_b
> 0.0,0.0
> -0.0,0.0{code}
> ...and process it via:
> {code:java}
> from pyarrow import csv, parquet
> in_csv = 'in.csv'
> table = csv.read_csv(in_csv)
> parquet.write_to_dataset(table, root_path='./'){code}
>  
> The output parquet file is then loaded into S3 and queried via AWS Athena 
> (i.e. PrestoDB / Hive). 
> Any query that touches {{col_a}} fails with the following error:
> {code:java}
> HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split {{REDACTED}} (offset=0, 
> length=593): low must be less than or equal to high{code}
>  
> As a sanity check, I transformed the csv file to parquet using an AWS Glue 
> Spark Job and I was able to query the output parquet file successfully.
> As such, it appears as though the pyarrow writer is producing an invalid 
> parquet file when a column contains at least one instance of 0.0, at least 
> one instance of -0.0, and no other values.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5562) [C++] parquet writer does not handle negative zero correctly

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5562:

Fix Version/s: 0.15.0

> [C++] parquet writer does not handle negative zero correctly
> 
>
> Key: ARROW-5562
> URL: https://issues.apache.org/jira/browse/ARROW-5562
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.13.0
>Reporter: Bob Briody
>Priority: Major
>  Labels: parquet
> Fix For: 0.15.0
>
>
>  
> I have the following csv file: (Note that {{col_a}} contains a negative zero 
> value.)
> {code:java}
> col_a,col_b
> 0.0,0.0
> -0.0,0.0{code}
> ...and process it via:
> {code:java}
> from pyarrow import csv, parquet
> in_csv = 'in.csv'
> table = csv.read_csv(in_csv)
> parquet.write_to_dataset(table, root_path='./'){code}
>  
> The output parquet file is then loaded into S3 and queried via AWS Athena 
> (i.e. PrestoDB / Hive). 
> Any query that touches {{col_a}} fails with the following error:
> {code:java}
> HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split {{REDACTED}} (offset=0, 
> length=593): low must be less than or equal to high{code}
>  
> As a sanity check, I transformed the csv file to parquet using an AWS Glue 
> Spark Job and I was able to query the output parquet file successfully.
> As such, it appears as though the pyarrow writer is producing an invalid 
> parquet file when a column contains at least one instance of 0.0, at least 
> one instance of -0.0, and no other values.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5522) [Packaging] Comments out of date in python/manylinux1/build_arrow.sh

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5522:

Fix Version/s: 0.15.0

> [Packaging] Comments out of date in python/manylinux1/build_arrow.sh
> 
>
> Key: ARROW-5522
> URL: https://issues.apache.org/jira/browse/ARROW-5522
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging, Python
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.15.0
>
>
> The script has this comment:
> {code:java}
> # Usage:
> #   docker run --rm -v $PWD:/io arrow-base-x86_64 /io/build_arrow.sh
> {code}
> However, I get:
> {code}
> Unable to find image 'arrow-base-x86_64:latest' locally
> docker: Error response from daemon: pull access denied for arrow-base-x86_64, 
> repository does not exist or may require 'docker login'.
> See 'docker run --help'.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5558) [C++] Support Array::View on arrays with non-zero offsets

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5558:

Fix Version/s: 0.15.0

> [C++] Support Array::View on arrays with non-zero offsets
> -
>
> Key: ARROW-5558
> URL: https://issues.apache.org/jira/browse/ARROW-5558
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> Follow up work to initial implementation of {{Array::View}} in ARROW-1774



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5550) [C++] Refactor Buffers method on concatenate to consolidate code.

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5550:

Fix Version/s: 0.15.0

> [C++] Refactor Buffers method on concatenate to consolidate code.
> -
>
> Key: ARROW-5550
> URL: https://issues.apache.org/jira/browse/ARROW-5550
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Micah Kornfield
>Priority: Minor
> Fix For: 0.15.0
>
>
> See https://github.com/apache/arrow/pull/4498/files for reference.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5508) [C++] Create reusable Iterator interface

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5508:

Fix Version/s: 0.15.0

> [C++] Create reusable Iterator interface 
> 
>
> Key: ARROW-5508
> URL: https://issues.apache.org/jira/browse/ARROW-5508
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> We have various iterator-like classes. I envision a reusable interface like
> {code}
> template 
> class Iterator {
>  public:
>   virtual ~Iterator() = default;
>   virtual Status Next(T* out) = 0;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5508) [C++] Create reusable Iterator interface

2019-08-22 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913772#comment-16913772
 ] 

Wes McKinney commented on ARROW-5508:
-

[~pitrou] [~bkietz] [~fsaintjacques] is the {{Iterator}} in the codebase now 
satisfactory?

> [C++] Create reusable Iterator interface 
> 
>
> Key: ARROW-5508
> URL: https://issues.apache.org/jira/browse/ARROW-5508
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> We have various iterator-like classes. I envision a reusable interface like
> {code}
> template 
> class Iterator {
>  public:
>   virtual ~Iterator() = default;
>   virtual Status Next(T* out) = 0;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5489) [C++] Normalize kernels and ChunkedArray behavior

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5489:

Fix Version/s: 1.0.0

> [C++] Normalize kernels and ChunkedArray behavior
> -
>
> Key: ARROW-5489
> URL: https://issues.apache.org/jira/browse/ARROW-5489
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Francois Saint-Jacques
>Priority: Major
> Fix For: 1.0.0
>
>
> Some kernels (the wrappers, e.g. Unique) support ChunkedArray inputs, and 
> some don't. We should normalize this usage.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5494) [Python] Create FileSystem bindings

2019-08-22 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913771#comment-16913771
 ] 

Wes McKinney commented on ARROW-5494:
-

I added to 0.15.0 optimistically. Having a minimal binding would be a nice 
thing, and probably not more than a couple of days' work

> [Python] Create FileSystem bindings
> ---
>
> Key: ARROW-5494
> URL: https://issues.apache.org/jira/browse/ARROW-5494
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: filesystem
> Fix For: 0.15.0
>
>
> Now that we have a C++ filesystem API, it should be usable from Python as 
> well.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5489) [C++] Normalize kernels and ChunkedArray behavior

2019-08-22 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913769#comment-16913769
 ] 

Wes McKinney commented on ARROW-5489:
-

I think kernels probably shouldn't support ChunkedArray at all unless there's a 
good kernel-specific reason. Dispatching kernels on chunked array would best be 
taken care of at a higher level (e.g. the "Data Frame" interface we've been 
discussing)

> [C++] Normalize kernels and ChunkedArray behavior
> -
>
> Key: ARROW-5489
> URL: https://issues.apache.org/jira/browse/ARROW-5489
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Francois Saint-Jacques
>Priority: Major
>
> Some kernels (the wrappers, e.g. Unique) support ChunkedArray inputs, and 
> some don't. We should normalize this usage.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5494) [Python] Create FileSystem bindings

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5494:

Fix Version/s: 0.15.0

> [Python] Create FileSystem bindings
> ---
>
> Key: ARROW-5494
> URL: https://issues.apache.org/jira/browse/ARROW-5494
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: filesystem
> Fix For: 0.15.0
>
>
> Now that we have a C++ filesystem API, it should be usable from Python as 
> well.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Comment Edited] (ARROW-5450) [Python] TimestampArray.to_pylist() fails with OverflowError: Python int too large to convert to C long

2019-08-22 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913766#comment-16913766
 ] 

Wes McKinney edited comment on ARROW-5450 at 8/22/19 10:45 PM:
---

Added to 0.15.0. I think we should return {{datetime.datetime}} objects except 
for nanosecond timestamps, which should use Timestamp but only if pandas is 
available


was (Author: wesmckinn):
Added to 0.15.0. I think we should return {{datetime.datetime}} objects except 
for nanosecond timestamps

> [Python] TimestampArray.to_pylist() fails with OverflowError: Python int too 
> large to convert to C long
> ---
>
> Key: ARROW-5450
> URL: https://issues.apache.org/jira/browse/ARROW-5450
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Tim Swast
>Priority: Major
> Fix For: 0.15.0
>
>
> When I attempt to roundtrip from a list of moderately large (beyond what can 
> be represented in nanosecond precision, but within microsecond precision) 
> datetime objects to pyarrow and back, I get an OverflowError: Python int too 
> large to convert to C long.
> pyarrow version:
> {noformat}
> $ pip freeze | grep pyarrow
> pyarrow==0.13.0{noformat}
>  
> Reproduction:
> {code:java}
> import datetime
> import pandas
> import pyarrow
> import pytz
> timestamp_rows = [
> datetime.datetime(1, 1, 1, 0, 0, 0, tzinfo=pytz.utc),
> None,
> datetime.datetime(, 12, 31, 23, 59, 59, 99, tzinfo=pytz.utc),
> datetime.datetime(1970, 1, 1, 0, 0, 0, tzinfo=pytz.utc),
> ]
> timestamp_array = pyarrow.array(timestamp_rows, pyarrow.timestamp("us", 
> tz="UTC"))
> timestamp_roundtrip = timestamp_array.to_pylist()
> # ---
> # OverflowError Traceback (most recent call last)
> #  in 
> # > 1 timestamp_roundtrip = timestamp_array.to_pylist()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/array.pxi
>  in __iter__()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi
>  in pyarrow.lib.TimestampValue.as_py()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi
>  in pyarrow.lib._datetime_conversion_functions.lambda5()
> #
> # pandas/_libs/tslibs/timestamps.pyx in 
> pandas._libs.tslibs.timestamps.Timestamp.__new__()
> #
> # pandas/_libs/tslibs/conversion.pyx in 
> pandas._libs.tslibs.conversion.convert_to_tsobject()
> #
> # OverflowError: Python int too large to convert to C long
> {code}
> For good measure, I also tested with timezone-naive timestamps with the same 
> error:
> {code:java}
> naive_rows = [
> datetime.datetime(1, 1, 1, 0, 0, 0),
> None,
> datetime.datetime(, 12, 31, 23, 59, 59, 99),
> datetime.datetime(1970, 1, 1, 0, 0, 0),
> ]
> naive_array = pyarrow.array(naive_rows, pyarrow.timestamp("us", tz=None))
> naive_roundtrip = naive_array.to_pylist()
> # ---
> # OverflowError Traceback (most recent call last)
> #  in 
> # > 1 naive_roundtrip = naive_array.to_pylist()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/array.pxi
>  in __iter__()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi
>  in pyarrow.lib.TimestampValue.as_py()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi
>  in pyarrow.lib._datetime_conversion_functions.lambda5()
> #
> # pandas/_libs/tslibs/timestamps.pyx in 
> pandas._libs.tslibs.timestamps.Timestamp.__new__()
> #
> # pandas/_libs/tslibs/conversion.pyx in 
> pandas._libs.tslibs.conversion.convert_to_tsobject()
> #
> # OverflowError: Python int too large to convert to C long
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5450) [Python] TimestampArray.to_pylist() fails with OverflowError: Python int too large to convert to C long

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5450:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [Python] TimestampArray.to_pylist() fails with OverflowError: Python int too 
> large to convert to C long
> ---
>
> Key: ARROW-5450
> URL: https://issues.apache.org/jira/browse/ARROW-5450
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Tim Swast
>Priority: Major
> Fix For: 0.15.0
>
>
> When I attempt to roundtrip from a list of moderately large (beyond what can 
> be represented in nanosecond precision, but within microsecond precision) 
> datetime objects to pyarrow and back, I get an OverflowError: Python int too 
> large to convert to C long.
> pyarrow version:
> {noformat}
> $ pip freeze | grep pyarrow
> pyarrow==0.13.0{noformat}
>  
> Reproduction:
> {code:java}
> import datetime
> import pandas
> import pyarrow
> import pytz
> timestamp_rows = [
> datetime.datetime(1, 1, 1, 0, 0, 0, tzinfo=pytz.utc),
> None,
> datetime.datetime(, 12, 31, 23, 59, 59, 99, tzinfo=pytz.utc),
> datetime.datetime(1970, 1, 1, 0, 0, 0, tzinfo=pytz.utc),
> ]
> timestamp_array = pyarrow.array(timestamp_rows, pyarrow.timestamp("us", 
> tz="UTC"))
> timestamp_roundtrip = timestamp_array.to_pylist()
> # ---
> # OverflowError Traceback (most recent call last)
> #  in 
> # > 1 timestamp_roundtrip = timestamp_array.to_pylist()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/array.pxi
>  in __iter__()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi
>  in pyarrow.lib.TimestampValue.as_py()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi
>  in pyarrow.lib._datetime_conversion_functions.lambda5()
> #
> # pandas/_libs/tslibs/timestamps.pyx in 
> pandas._libs.tslibs.timestamps.Timestamp.__new__()
> #
> # pandas/_libs/tslibs/conversion.pyx in 
> pandas._libs.tslibs.conversion.convert_to_tsobject()
> #
> # OverflowError: Python int too large to convert to C long
> {code}
> For good measure, I also tested with timezone-naive timestamps with the same 
> error:
> {code:java}
> naive_rows = [
> datetime.datetime(1, 1, 1, 0, 0, 0),
> None,
> datetime.datetime(, 12, 31, 23, 59, 59, 99),
> datetime.datetime(1970, 1, 1, 0, 0, 0),
> ]
> naive_array = pyarrow.array(naive_rows, pyarrow.timestamp("us", tz=None))
> naive_roundtrip = naive_array.to_pylist()
> # ---
> # OverflowError Traceback (most recent call last)
> #  in 
> # > 1 naive_roundtrip = naive_array.to_pylist()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/array.pxi
>  in __iter__()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi
>  in pyarrow.lib.TimestampValue.as_py()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi
>  in pyarrow.lib._datetime_conversion_functions.lambda5()
> #
> # pandas/_libs/tslibs/timestamps.pyx in 
> pandas._libs.tslibs.timestamps.Timestamp.__new__()
> #
> # pandas/_libs/tslibs/conversion.pyx in 
> pandas._libs.tslibs.conversion.convert_to_tsobject()
> #
> # OverflowError: Python int too large to convert to C long
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5450) [Python] TimestampArray.to_pylist() fails with OverflowError: Python int too large to convert to C long

2019-08-22 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913766#comment-16913766
 ] 

Wes McKinney commented on ARROW-5450:
-

Added to 0.15.0. I think we should return {{datetime.datetime}} objects except 
for nanosecond timestamps

> [Python] TimestampArray.to_pylist() fails with OverflowError: Python int too 
> large to convert to C long
> ---
>
> Key: ARROW-5450
> URL: https://issues.apache.org/jira/browse/ARROW-5450
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Tim Swast
>Priority: Major
> Fix For: 0.15.0
>
>
> When I attempt to roundtrip from a list of moderately large (beyond what can 
> be represented in nanosecond precision, but within microsecond precision) 
> datetime objects to pyarrow and back, I get an OverflowError: Python int too 
> large to convert to C long.
> pyarrow version:
> {noformat}
> $ pip freeze | grep pyarrow
> pyarrow==0.13.0{noformat}
>  
> Reproduction:
> {code:java}
> import datetime
> import pandas
> import pyarrow
> import pytz
> timestamp_rows = [
> datetime.datetime(1, 1, 1, 0, 0, 0, tzinfo=pytz.utc),
> None,
> datetime.datetime(, 12, 31, 23, 59, 59, 99, tzinfo=pytz.utc),
> datetime.datetime(1970, 1, 1, 0, 0, 0, tzinfo=pytz.utc),
> ]
> timestamp_array = pyarrow.array(timestamp_rows, pyarrow.timestamp("us", 
> tz="UTC"))
> timestamp_roundtrip = timestamp_array.to_pylist()
> # ---
> # OverflowError Traceback (most recent call last)
> #  in 
> # > 1 timestamp_roundtrip = timestamp_array.to_pylist()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/array.pxi
>  in __iter__()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi
>  in pyarrow.lib.TimestampValue.as_py()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi
>  in pyarrow.lib._datetime_conversion_functions.lambda5()
> #
> # pandas/_libs/tslibs/timestamps.pyx in 
> pandas._libs.tslibs.timestamps.Timestamp.__new__()
> #
> # pandas/_libs/tslibs/conversion.pyx in 
> pandas._libs.tslibs.conversion.convert_to_tsobject()
> #
> # OverflowError: Python int too large to convert to C long
> {code}
> For good measure, I also tested with timezone-naive timestamps with the same 
> error:
> {code:java}
> naive_rows = [
> datetime.datetime(1, 1, 1, 0, 0, 0),
> None,
> datetime.datetime(, 12, 31, 23, 59, 59, 99),
> datetime.datetime(1970, 1, 1, 0, 0, 0),
> ]
> naive_array = pyarrow.array(naive_rows, pyarrow.timestamp("us", tz=None))
> naive_roundtrip = naive_array.to_pylist()
> # ---
> # OverflowError Traceback (most recent call last)
> #  in 
> # > 1 naive_roundtrip = naive_array.to_pylist()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/array.pxi
>  in __iter__()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi
>  in pyarrow.lib.TimestampValue.as_py()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi
>  in pyarrow.lib._datetime_conversion_functions.lambda5()
> #
> # pandas/_libs/tslibs/timestamps.pyx in 
> pandas._libs.tslibs.timestamps.Timestamp.__new__()
> #
> # pandas/_libs/tslibs/conversion.pyx in 
> pandas._libs.tslibs.conversion.convert_to_tsobject()
> #
> # OverflowError: Python int too large to convert to C long
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Closed] (ARROW-5425) [C++] Non-determinism encountered with cmake-format 0.5.1

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-5425.
---
Resolution: Not A Problem

We're on 0.5.2 now and things seem to be OK

> [C++] Non-determinism encountered with cmake-format 0.5.1
> -
>
> Key: ARROW-5425
> URL: https://issues.apache.org/jira/browse/ARROW-5425
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> I ran cmake-format 0.5.1 locally and experienced non-deterministic changes 
> that break the build such as
> {code:diff}
> diff --git a/cpp/cmake_modules/BuildUtils.cmake 
> b/cpp/cmake_modules/BuildUtils.cmake
> index 45cff6e9c..c80f21f34 100644
> --- a/cpp/cmake_modules/BuildUtils.cmake
> +++ b/cpp/cmake_modules/BuildUtils.cmake
> @@ -179,7 +179,7 @@ function(ADD_ARROW_LIB LIB_NAME)
>else()
>  # Otherwise, generate a single "objlib" from all C++ modules and link
>  # that "objlib" into each library kind, to avoid compiling twice
> -add_library(${LIB_NAME}_objlib OBJECT ${ARG_SOURCES})
> +add_library(${LIB_NAME}_objlib ${ARG_SOURCES} OBJECT)
>  # Necessary to make static linking into other shared libraries work 
> properly
>  set_property(TARGET ${LIB_NAME}_objlib PROPERTY 
> POSITION_INDEPENDENT_CODE 1)
>  if(ARG_DEPENDENCIES)
> {code}
> I also found that some uses of {{UNKNOWN IMPORTED}} were rewritten as 
> {{IMPORTED UNKNOWN}}. I have run cmake-format a few more times and haven't 
> observed this again but wanted to report it in case anyone else encounters



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5417) [Website] http://arrow.apache.org doesn't redirect to https

2019-08-22 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913764#comment-16913764
 ] 

Wes McKinney commented on ARROW-5417:
-

Appears to be an INFRA issue. Might be worth asking infra about what's going on

> [Website] http://arrow.apache.org doesn't redirect to https
> ---
>
> Key: ARROW-5417
> URL: https://issues.apache.org/jira/browse/ARROW-5417
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Neal Richardson
>Priority: Minor
>
> This should be a simple (for someone sufficiently authorized) config change 
> somewhere.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5406) [Plasma] enable Subscribe and GetNotification from Java

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5406:

Summary: [Plasma] enable Subscribe and GetNotification from Java  (was: 
enable Subscribe and GetNotification from Java)

> [Plasma] enable Subscribe and GetNotification from Java
> ---
>
> Key: ARROW-5406
> URL: https://issues.apache.org/jira/browse/ARROW-5406
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++ - Plasma, Java
>Reporter: Tim Emerick
>Priority: Minor
>
> Currently, these functions exist in the cpp API, but are not exposed via JNI.
> If this is a feature that is in line with the project direction, I would be 
> happy to implement it.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5375) [C++] Try to move out of public headers

2019-08-22 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913763#comment-16913763
 ] 

Wes McKinney commented on ARROW-5375:
-

I moved this to 1.0.0. Seems like sstream, iostream, unordered_map, and others 
might be worth having on a public API blacklist, and we can check this in CI 
using linter scripts

> [C++] Try to move  out of public headers
> -
>
> Key: ARROW-5375
> URL: https://issues.apache.org/jira/browse/ARROW-5375
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Minor
> Fix For: 1.0.0
>
>
> Followup to ARROW-5102: to try and reduce compile times, try to move 
> inclusions of {{sstream}} (and other costly headers) out of Arrow public 
> header such as {{status.h}}.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5375) [C++] Try to move out of public headers

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5375:

Fix Version/s: 1.0.0

> [C++] Try to move  out of public headers
> -
>
> Key: ARROW-5375
> URL: https://issues.apache.org/jira/browse/ARROW-5375
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Minor
> Fix For: 1.0.0
>
>
> Followup to ARROW-5102: to try and reduce compile times, try to move 
> inclusions of {{sstream}} (and other costly headers) out of Arrow public 
> header such as {{status.h}}.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5343) [C++] Consider using Buffer for transpose maps in DictionaryType::Unify instead of std::vector

2019-08-22 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913761#comment-16913761
 ] 

Wes McKinney commented on ARROW-5343:
-

I'm also interested in incremental unification so I'll have a look at this

> [C++] Consider using Buffer for transpose maps in DictionaryType::Unify 
> instead of std::vector
> --
>
> Key: ARROW-5343
> URL: https://issues.apache.org/jira/browse/ARROW-5343
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> In the spirit of "track all the allocations", if dictionaries have 
> non-trivial length, we may want to account for this memory more precisely. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5343) [C++] Consider using Buffer for transpose maps in DictionaryType::Unify instead of std::vector

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5343:

Fix Version/s: 0.15.0

> [C++] Consider using Buffer for transpose maps in DictionaryType::Unify 
> instead of std::vector
> --
>
> Key: ARROW-5343
> URL: https://issues.apache.org/jira/browse/ARROW-5343
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> In the spirit of "track all the allocations", if dictionaries have 
> non-trivial length, we may want to account for this memory more precisely. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5337) [C++] Add RecordBatch::field method, possibly deprecate "column"

2019-08-22 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913760#comment-16913760
 ] 

Wes McKinney commented on ARROW-5337:
-

Raising this issue again for consideration ([~jorisvandenbossche] if thoughts?) 
given that {{arrow::Column}} is gone now

> [C++] Add RecordBatch::field method, possibly deprecate "column"
> 
>
> Key: ARROW-5337
> URL: https://issues.apache.org/jira/browse/ARROW-5337
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> As a matter of consistency, it might be better to rename 
> {{RecordBatch::column}} to {{RecordBatch::field}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5337) [C++] Add RecordBatch::field method, possibly deprecate "column"

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5337:

Fix Version/s: 0.15.0

> [C++] Add RecordBatch::field method, possibly deprecate "column"
> 
>
> Key: ARROW-5337
> URL: https://issues.apache.org/jira/browse/ARROW-5337
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> As a matter of consistency, it might be better to rename 
> {{RecordBatch::column}} to {{RecordBatch::field}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Closed] (ARROW-5312) [C++] Move JSON integration testing utilities to arrow/testing and libarrow_testing.so

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-5312.
---
Resolution: Won't Fix

> [C++] Move JSON integration testing utilities to arrow/testing and 
> libarrow_testing.so
> --
>
> Key: ARROW-5312
> URL: https://issues.apache.org/jira/browse/ARROW-5312
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> It's not necessary to have this code in libarrow.so. Let's tackle after 
> ARROW-3144 and ARROW-835



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5320) [C++] Undefined symbol errors are occurred when linking parquet executables

2019-08-22 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913759#comment-16913759
 ] 

Wes McKinney commented on ARROW-5320:
-

Is there a way to fix this in our build system?

> [C++] Undefined symbol errors are occurred when linking parquet executables
> ---
>
> Key: ARROW-5320
> URL: https://issues.apache.org/jira/browse/ARROW-5320
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
> Environment: Xcode 10.2 on macOS Mojave 10.14.4
>Reporter: Kenta Murata
>Priority: Major
>
> Undefined symbol errors occurred when linking debug/parquet-reader, 
> debug/parquet-file-deserialize-test, and debug/parquet-scan.  The 
> unresolvable symbol is of boost regex referred in libparquet.a.
> I tried to build the commit 608e846a9f825a30a0faa651bc0a3eebba20e7db with 
> Xcode 10.2 on macOS Mojave.
> I specified -DARROW_BOOST_VENDORED=ON to avoid the problem related to the 
> latest boost in Homebrew (See 
> [https://github.com/boostorg/process/issues/55]).
> The complete build log is available here:
> [https://gist.github.com/mrkn/e5489140c9a782ca13a1b4bb8dd33111]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5300) [C++] 0.13 FAILED to build with option -DARROW_NO_DEFAULT_MEMORY_POOL

2019-08-22 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913757#comment-16913757
 ] 

Wes McKinney commented on ARROW-5300:
-

Added to 0.15.0. I think we ought to just nix this macro

> [C++] 0.13 FAILED to build with option -DARROW_NO_DEFAULT_MEMORY_POOL
> -
>
> Key: ARROW-5300
> URL: https://issues.apache.org/jira/browse/ARROW-5300
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.13.0
>Reporter: Weihua Jiang
>Priority: Major
> Fix For: 0.15.0
>
>
> I tried to upgrade Apache Arrow to 0.13. But, when building Apache Arrow 0.13 
> with option {{-DARROW_NO_DEFAULT_MEMORY_POOL}}, I got a lot of failures.
> It seems 0.13 assuming default memory pool always available.
>  
> My cmake command is:
> |{{make .. -DCMAKE_BUILD_TYPE=Release -DARROW_BUILD_TESTS=off 
> -DARROW_USE_GLOG=off -DARROW_WITH_LZ4=off -DARROW_WITH_ZSTD=off 
> -DARROW_WITH_SNAPPY=off -DARROW_WITH_BROTLI=off -DARROW_WITH_ZLIB=off 
> -DARROW_JEMALLOC=off -DARROW_CXXFLAGS=-DARROW_NO_DEFAULT_MEMORY_POOL}}|
> I tried to fix the compilation by adding some missing constructors. However, 
> it seems this issue is bigger than I expected. It seems all the builders and 
> appenders have this issue as many classes even don't have a memory pool 
> associated. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5310) [Python] better error message on creating ParquetDataset from empty directory

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5310:

Labels: dataset parquet  (was: parquet)

> [Python] better error message on creating ParquetDataset from empty directory
> -
>
> Key: ARROW-5310
> URL: https://issues.apache.org/jira/browse/ARROW-5310
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Minor
>  Labels: dataset, parquet
>
> Currently, you get when {{path}} is an existing but empty directory:
> {code:python}
> >>> dataset = pq.ParquetDataset(path)
> ---
> IndexErrorTraceback (most recent call last)
>  in 
> > 1 dataset = pq.ParquetDataset(path)
> ~/scipy/repos/arrow/python/pyarrow/parquet.py in __init__(self, 
> path_or_paths, filesystem, schema, metadata, split_row_groups, 
> validate_schema, filters, metadata_nthreads, memory_map)
> 989 
> 990 if validate_schema:
> --> 991 self.validate_schemas()
> 992 
> 993 if filters is not None:
> ~/scipy/repos/arrow/python/pyarrow/parquet.py in validate_schemas(self)
>1025 self.schema = self.common_metadata.schema
>1026 else:
> -> 1027 self.schema = self.pieces[0].get_metadata().schema
>1028 elif self.schema is None:
>1029 self.schema = self.metadata.schema
> IndexError: list index out of range
> {code}
> That could be a nicer error message. 
> Unless we actually want to allow this? (although I am not sure there are good 
> use cases of empty directories to support this, because from an empty 
> directory we cannot get any schema or metadata information?) 
> It is only failing when validating the schemas, so with 
> {{validate_schema=False}} it actually returns a ParquetDataset object, just 
> with an empty list for {{pieces}} and no schema. So it would be easy to not 
> error when validating the schemas as well for this empty-directory case.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5295) [Python] accept pyarrow values / scalars in constructor functions ?

2019-08-22 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913756#comment-16913756
 ] 

Wes McKinney commented on ARROW-5295:
-

A "simple" workaround would be to invoke Scalars' {{as_py}} method if they're 
passed in. That would add perf overhead, though, since we'd need to do 
{{isinstance}} checks. 

Another option is to "sanitize" inputs (using a helper function) only in the 
case of failure on the initial try. So the normal use case won't be affected

> [Python] accept pyarrow values / scalars in constructor functions ?
> ---
>
> Key: ARROW-5295
> URL: https://issues.apache.org/jira/browse/ARROW-5295
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>
> Currently, functions like \{{pyarrow.array}} don't accept pyarrow Arrays, or 
> also not scalars of it:
> {code}
> In [42]: arr = pa.array([1, 2, 3])
> In [43]: pa.array(arr)
> ...
> ArrowInvalid: Could not convert 1 with type pyarrow.lib.Int64Value: did not 
> recognize Python value type when inferring an Arrow data type
> In [44]: pa.array(list(arr))
> ...
> ArrowInvalid: Could not convert 1 with type pyarrow.lib.Int64Value: did not 
> recognize Python value type when inferring an Arrow data type
> {code}
> Do we want to allow those / recognize those here? (the first case could even 
> have a fastpath, as we don't need to do it element by element).
> Also scalars are not supported:
> {code}
> In [46]: type(arr.sum())
> Out[46]: pyarrow.lib.Int64Scalar
> In [47]: pa.array([arr.sum()])
> ...
> ArrowInvalid: Could not convert 6 with type pyarrow.lib.Int64Scalar: did not 
> recognize Python value type when inferring an Arrow data type
> {code}
> And also in other functions we don't accept arrow scalars / values:
> {code}
> In [48]: string = pa.array(['a'])[0]
> In [49]: type(string)
> Out[49]: pyarrow.lib.StringValue
> In [50]: pa.field(string, pa.int64())
> ...
> TypeError: expected bytes, pyarrow.lib.StringValue found
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5300) [C++] 0.13 FAILED to build with option -DARROW_NO_DEFAULT_MEMORY_POOL

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5300:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [C++] 0.13 FAILED to build with option -DARROW_NO_DEFAULT_MEMORY_POOL
> -
>
> Key: ARROW-5300
> URL: https://issues.apache.org/jira/browse/ARROW-5300
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.13.0
>Reporter: Weihua Jiang
>Priority: Major
> Fix For: 0.15.0
>
>
> I tried to upgrade Apache Arrow to 0.13. But, when building Apache Arrow 0.13 
> with option {{-DARROW_NO_DEFAULT_MEMORY_POOL}}, I got a lot of failures.
> It seems 0.13 assuming default memory pool always available.
>  
> My cmake command is:
> |{{make .. -DCMAKE_BUILD_TYPE=Release -DARROW_BUILD_TESTS=off 
> -DARROW_USE_GLOG=off -DARROW_WITH_LZ4=off -DARROW_WITH_ZSTD=off 
> -DARROW_WITH_SNAPPY=off -DARROW_WITH_BROTLI=off -DARROW_WITH_ZLIB=off 
> -DARROW_JEMALLOC=off -DARROW_CXXFLAGS=-DARROW_NO_DEFAULT_MEMORY_POOL}}|
> I tried to fix the compilation by adding some missing constructors. However, 
> it seems this issue is bigger than I expected. It seems all the builders and 
> appenders have this issue as many classes even don't have a memory pool 
> associated. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5292) [C++] Static libraries are built on AppVeyor

2019-08-22 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913753#comment-16913753
 ] 

Wes McKinney commented on ARROW-5292:
-

I'm interested in trying to work around the Parquet unit test issue so we only 
have to build the shared libraries also on Windows

> [C++] Static libraries are built on AppVeyor
> 
>
> Key: ARROW-5292
> URL: https://issues.apache.org/jira/browse/ARROW-5292
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.15.0
>
>
> Building both static and shared libraries on Windows needs to compile all 
> source files twice, making CI slwoer.
> Normally, only the shared libraries are needed for testing (except for 
> Parquet, see PARQUET-1420).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5215) [C++] Need a way to change integer size in DictBuilder and AdaptiveIntBuilder

2019-08-22 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913754#comment-16913754
 ] 

Antoine Pitrou commented on ARROW-5215:
---

We can probably do that. I suppose it would be templated?

> [C++] Need a way to change integer size in DictBuilder and AdaptiveIntBuilder
> -
>
> Key: ARROW-5215
> URL: https://issues.apache.org/jira/browse/ARROW-5215
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.13.0
>Reporter: Antoine Pitrou
>Priority: Major
>
> Currently, there's no way to choose the index type with DictionaryBuilder. 
> Consequently, it's difficult to implement e.g. casts to a specific dictionary 
> type (instead the DictionaryBuilder will select the smallest possible integer 
> size for the index type).
> This will allow implementing "null to dict" casting.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5292) [C++] Static libraries are built on AppVeyor

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5292:

Fix Version/s: 0.15.0

> [C++] Static libraries are built on AppVeyor
> 
>
> Key: ARROW-5292
> URL: https://issues.apache.org/jira/browse/ARROW-5292
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.15.0
>
>
> Building both static and shared libraries on Windows needs to compile all 
> source files twice, making CI slwoer.
> Normally, only the shared libraries are needed for testing (except for 
> Parquet, see PARQUET-1420).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Comment Edited] (ARROW-5220) [Python] index / unknown columns in specified schema in Table.from_pandas

2019-08-22 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913752#comment-16913752
 ] 

Wes McKinney edited comment on ARROW-5220 at 8/22/19 10:30 PM:
---

I'm in theory on board with that idea


was (Author: wesmckinn):
I'm in theory on board with that idae

> [Python] index / unknown columns in specified schema in Table.from_pandas
> -
>
> Key: ARROW-5220
> URL: https://issues.apache.org/jira/browse/ARROW-5220
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Minor
>
> The {{Table.from_pandas}} method allows to specify a schema ("This can be 
> used to indicate the type of columns if we cannot infer it automatically.").
> But, if you also want to specify the type of the index, you get an error:
> {code:python}
> df = pd.DataFrame({'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3]})
> df.index = pd.Index(['a', 'b', 'c'], name='index')
> my_schema = pa.schema([('index', pa.string()),
>    ('a', pa.int64()),
>    ('b', pa.float64()),
>   ])
> table = pa.Table.from_pandas(df, schema=my_schema)
> {code}
> gives {{KeyError: 'index'}} (because it tries to look up the "column names" 
> from the schema in the dataframe, and thus does not find column 'index').
> This also has the consequence that re-using the schema does not work: 
> {{table1 = pa.Table.from_pandas(df1);  table2 = pa.Table.from_pandas(df2, 
> schema=table1.schema)}}
> Extra note: also unknown columns in general give this error (column specified 
> in the schema that are not in the dataframe).
> At least in pyarrow 0.11, this did not give an error (eg noticed this from 
> the code in example in ARROW-3861). So before, unknown columns in the 
> specified schema were ignored, while now they raise an error. Was this a 
> conscious change?  
> So before also specifying the index in the schema "worked" in the sense that 
> it didn't raise an error, but it was also ignored, so didn't actually do what 
> you would expect)
> Questions:
> - I think that we should support specifying the index in the passed 
> {{schema}} ? So that the example above works (although this might be 
> complicated with RangeIndex that is not serialized any more)
> - But what to do in general with additional columns in the schema that are 
> not in the DataFrame? Are we fine with keep raising an error as it is now 
> (the error message could be improved then)? Or do we again want to ignore 
> them? (or, it could actually also add them as all nulls to the table)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5220) [Python] index / unknown columns in specified schema in Table.from_pandas

2019-08-22 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913752#comment-16913752
 ] 

Wes McKinney commented on ARROW-5220:
-

I'm in theory on board with that idae

> [Python] index / unknown columns in specified schema in Table.from_pandas
> -
>
> Key: ARROW-5220
> URL: https://issues.apache.org/jira/browse/ARROW-5220
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Minor
>
> The {{Table.from_pandas}} method allows to specify a schema ("This can be 
> used to indicate the type of columns if we cannot infer it automatically.").
> But, if you also want to specify the type of the index, you get an error:
> {code:python}
> df = pd.DataFrame({'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3]})
> df.index = pd.Index(['a', 'b', 'c'], name='index')
> my_schema = pa.schema([('index', pa.string()),
>    ('a', pa.int64()),
>    ('b', pa.float64()),
>   ])
> table = pa.Table.from_pandas(df, schema=my_schema)
> {code}
> gives {{KeyError: 'index'}} (because it tries to look up the "column names" 
> from the schema in the dataframe, and thus does not find column 'index').
> This also has the consequence that re-using the schema does not work: 
> {{table1 = pa.Table.from_pandas(df1);  table2 = pa.Table.from_pandas(df2, 
> schema=table1.schema)}}
> Extra note: also unknown columns in general give this error (column specified 
> in the schema that are not in the dataframe).
> At least in pyarrow 0.11, this did not give an error (eg noticed this from 
> the code in example in ARROW-3861). So before, unknown columns in the 
> specified schema were ignored, while now they raise an error. Was this a 
> conscious change?  
> So before also specifying the index in the schema "worked" in the sense that 
> it didn't raise an error, but it was also ignored, so didn't actually do what 
> you would expect)
> Questions:
> - I think that we should support specifying the index in the passed 
> {{schema}} ? So that the example above works (although this might be 
> complicated with RangeIndex that is not serialized any more)
> - But what to do in general with additional columns in the schema that are 
> not in the DataFrame? Are we fine with keep raising an error as it is now 
> (the error message could be improved then)? Or do we again want to ignore 
> them? (or, it could actually also add them as all nulls to the table)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5239) [JS] Add support for interval types in javascript

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5239:

Summary: [JS] Add support for interval types in javascript  (was: Add 
support for interval types in javascript)

> [JS] Add support for interval types in javascript
> -
>
> Key: ARROW-5239
> URL: https://issues.apache.org/jira/browse/ARROW-5239
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: JavaScript
>Reporter: Micah Kornfield
>Priority: Major
>
> Update integration_test.py to include interval tests for JSTest once this is 
> done.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5215) [C++] Need a way to change integer size in DictBuilder and AdaptiveIntBuilder

2019-08-22 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913751#comment-16913751
 ] 

Wes McKinney commented on ARROW-5215:
-

Since I added {{Dictionary32Builder}} we are already 1/4 of the way there. We 
could add index-type-specific builders for the other 3 index types?

> [C++] Need a way to change integer size in DictBuilder and AdaptiveIntBuilder
> -
>
> Key: ARROW-5215
> URL: https://issues.apache.org/jira/browse/ARROW-5215
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.13.0
>Reporter: Antoine Pitrou
>Priority: Major
>
> Currently, there's no way to choose the index type with DictionaryBuilder. 
> Consequently, it's difficult to implement e.g. casts to a specific dictionary 
> type (instead the DictionaryBuilder will select the smallest possible integer 
> size for the index type).
> This will allow implementing "null to dict" casting.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5211) [Format] Missing documentation under `Dictionary encoding` section on MetaData page

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5211:

Fix Version/s: 0.15.0

> [Format] Missing documentation under `Dictionary encoding` section on 
> MetaData page
> ---
>
> Key: ARROW-5211
> URL: https://issues.apache.org/jira/browse/ARROW-5211
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Lennox Stevenson
>Priority: Trivial
>  Labels: documentation
> Fix For: 0.15.0
>
>
> First time throwing up an issue here so let me know if there's anything I 
> missed / more details I can provide.
> Just going through the arrow documentation at 
> [https://arrow.apache.org/docs/python/] and I noticed that there's a section 
> that is currently blank. From what I can tell the section 
> [https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding] 
> currently contains nothing in it. Is that intended? It was confusing to see a 
> blank section, but that is just my opinion so it may not be worth changing.
> If this is something work fixing / improving, then it's probably worth either 
> filling out that section or simply removing header to avoid future confusion.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-5211) [Format] Missing documentation under `Dictionary encoding` section on MetaData page

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-5211:
---

Assignee: Wes McKinney

> [Format] Missing documentation under `Dictionary encoding` section on 
> MetaData page
> ---
>
> Key: ARROW-5211
> URL: https://issues.apache.org/jira/browse/ARROW-5211
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Lennox Stevenson
>Assignee: Wes McKinney
>Priority: Trivial
>  Labels: documentation
> Fix For: 0.15.0
>
>
> First time throwing up an issue here so let me know if there's anything I 
> missed / more details I can provide.
> Just going through the arrow documentation at 
> [https://arrow.apache.org/docs/python/] and I noticed that there's a section 
> that is currently blank. From what I can tell the section 
> [https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding] 
> currently contains nothing in it. Is that intended? It was confusing to see a 
> blank section, but that is just my opinion so it may not be worth changing.
> If this is something work fixing / improving, then it's probably worth either 
> filling out that section or simply removing header to avoid future confusion.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5176) [Python] Automate formatting of python files

2019-08-22 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913749#comment-16913749
 ] 

Wes McKinney commented on ARROW-5176:
-

There hasn't been all that much Python churn lately. Seems like it might be a 
good "lull" to begin using Black. thoughts?

> [Python] Automate formatting of python files
> 
>
> Key: ARROW-5176
> URL: https://issues.apache.org/jira/browse/ARROW-5176
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Benjamin Kietzman
>Priority: Minor
> Fix For: 0.15.0
>
>
> [Black](https://github.com/ambv/black) is a tool for automatically formatting 
> python code in ways which flake8 and our other linters approve of. Adding it 
> to the project will allow more reliably formatted python code and fill a 
> similar role to {{clang-format}} for c++ and {{cmake-format}} for cmake



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5161) [Python] Cannot convert struct type from Pandas object column

2019-08-22 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913748#comment-16913748
 ] 

Wes McKinney commented on ARROW-5161:
-

Added to 0.15.0. 

> [Python] Cannot convert struct type from Pandas object column
> -
>
> Key: ARROW-5161
> URL: https://issues.apache.org/jira/browse/ARROW-5161
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.15.0
>
>
> As reported on [https://github.com/apache/arrow/issues/4045]. Interestingly, 
> the datatype is inferred correctly.
> {code:python}
> >>> df = pd.DataFrame({'col': [{'ints': 5, 'strs': 'foo'}, {'ints': 6, 
> >>> 'strs': 'bar'}]}) 
> >>> 
> >>> df
> >>>   
> >>>  
>   col
> 0  {'ints': 5, 'strs': 'foo'}
> 1  {'ints': 6, 'strs': 'bar'}
> >>> pa.Table.from_pandas(df)  
> >>>   
> >>>  
> Traceback (most recent call last):
>   File "", line 1, in 
> pa.Table.from_pandas(df)
>   File "pyarrow/table.pxi", line 1139, in pyarrow.lib.Table.from_pandas
> names, arrays, metadata = dataframe_to_arrays(
>   File "/home/antoine/arrow/dev/python/pyarrow/pandas_compat.py", line 480, 
> in dataframe_to_arrays
> types)
>   File "/home/antoine/arrow/dev/python/pyarrow/pandas_compat.py", line 209, 
> in construct_metadata
> field_name=sanitized_name)
>   File "/home/antoine/arrow/dev/python/pyarrow/pandas_compat.py", line 151, 
> in get_column_metadata
> logical_type = get_logical_type(arrow_type)
>   File "/home/antoine/arrow/dev/python/pyarrow/pandas_compat.py", line 79, in 
> get_logical_type
> raise NotImplementedError(str(arrow_type))
> NotImplementedError: struct
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5176) [Python] Automate formatting of python files

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5176:

Fix Version/s: 0.15.0

> [Python] Automate formatting of python files
> 
>
> Key: ARROW-5176
> URL: https://issues.apache.org/jira/browse/ARROW-5176
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Benjamin Kietzman
>Priority: Minor
> Fix For: 0.15.0
>
>
> [Black](https://github.com/ambv/black) is a tool for automatically formatting 
> python code in ways which flake8 and our other linters approve of. Adding it 
> to the project will allow more reliably formatted python code and fill a 
> similar role to {{clang-format}} for c++ and {{cmake-format}} for cmake



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5103) [Python] Segfault when using chunked_array.to_pandas on array different types (edge case)

2019-08-22 Thread Artem KOZHEVNIKOV (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913746#comment-16913746
 ] 

Artem KOZHEVNIKOV commented on ARROW-5103:
--

it was fixed in 0.14, wasn't it?

> [Python] Segfault when using chunked_array.to_pandas on array different types 
> (edge case)  
> ---
>
> Key: ARROW-5103
> URL: https://issues.apache.org/jira/browse/ARROW-5103
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.12.1, 0.13.0
> Environment: pyarrow 0.12.1 py37hf9e6f3b_0 conda-forge
> numpy   1.15.4   py37hacdab7b_0  
> MacOs | gcc7 | what else ?
>Reporter: Artem KOZHEVNIKOV
>Priority: Major
> Fix For: 0.15.0
>
>
> {code:java}
> import numpy as np
> import pyarrow as pa
> ca = pa.chunked_array([pa.array(['rr'] * 10), pa.array(np.arange(10))])
> ca.type
> ca.to_pandas()
> libc++abi.dylib: terminating with uncaught exception of type 
> std::length_error: basic_string
> Abort trap: 6
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5156) [Python] `df.to_parquet('s3://...', partition_cols=...)` fails with `'NoneType' object has no attribute '_isfilestore'`

2019-08-22 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913747#comment-16913747
 ] 

Wes McKinney commented on ARROW-5156:
-

ping [~jorisvandenbossche] [~jreback]

> [Python] `df.to_parquet('s3://...', partition_cols=...)` fails with 
> `'NoneType' object has no attribute '_isfilestore'`
> ---
>
> Key: ARROW-5156
> URL: https://issues.apache.org/jira/browse/ARROW-5156
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.1
> Environment: Mac, Linux
>Reporter: Victor Shih
>Priority: Major
>  Labels: parquet
> Fix For: 1.0.0
>
>
> According to 
> [https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#partitioning-parquet-files],
>  writing a parquet to S3 with `partition_cols` should work, but it fails for 
> me. Example script:
> {code:java}
> import pandas as pd
> import sys
> print(sys.version)
> print(pd._version_)
> df = pd.DataFrame([{'a': 1, 'b': 2}])
> df.to_parquet('s3://my_s3_bucket/x.parquet', engine='pyarrow')
> print('OK 1')
> df.to_parquet('s3://my_s3_bucket/x2.parquet', partition_cols=['a'], 
> engine='pyarrow')
> print('OK 2')
> {code}
> Output:
> {noformat}
> 3.5.2 (default, Feb 14 2019, 01:46:27)
> [GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)]
> 0.24.2
> OK 1
> Traceback (most recent call last):
> File "./t.py", line 14, in 
> df.to_parquet('s3://my_s3_bucket/x2.parquet', partition_cols=['a'], 
> engine='pyarrow')
> File 
> "/Users/vshih/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pandas/core/frame.py",
>  line 2203, in to_parquet
> partition_cols=partition_cols, **kwargs)
> File 
> "/Users/vshih/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pandas/io/parquet.py",
>  line 252, in to_parquet
> partition_cols=partition_cols, **kwargs)
> File 
> "/Users/vshih/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pandas/io/parquet.py",
>  line 118, in write
> partition_cols=partition_cols, **kwargs)
> File 
> "/Users/vshih/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pyarrow/parquet.py",
>  line 1227, in write_to_dataset
> _mkdir_if_not_exists(fs, root_path)
> File 
> "/Users/vshih/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pyarrow/parquet.py",
>  line 1182, in _mkdir_if_not_exists
> if fs._isfilestore() and not fs.exists(path):
> AttributeError: 'NoneType' object has no attribute '_isfilestore'
> {noformat}
>  
> Original issue - [https://github.com/apache/arrow/issues/4030]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5161) [Python] Cannot convert struct type from Pandas object column

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5161:

Fix Version/s: 0.15.0

> [Python] Cannot convert struct type from Pandas object column
> -
>
> Key: ARROW-5161
> URL: https://issues.apache.org/jira/browse/ARROW-5161
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.15.0
>
>
> As reported on [https://github.com/apache/arrow/issues/4045]. Interestingly, 
> the datatype is inferred correctly.
> {code:python}
> >>> df = pd.DataFrame({'col': [{'ints': 5, 'strs': 'foo'}, {'ints': 6, 
> >>> 'strs': 'bar'}]}) 
> >>> 
> >>> df
> >>>   
> >>>  
>   col
> 0  {'ints': 5, 'strs': 'foo'}
> 1  {'ints': 6, 'strs': 'bar'}
> >>> pa.Table.from_pandas(df)  
> >>>   
> >>>  
> Traceback (most recent call last):
>   File "", line 1, in 
> pa.Table.from_pandas(df)
>   File "pyarrow/table.pxi", line 1139, in pyarrow.lib.Table.from_pandas
> names, arrays, metadata = dataframe_to_arrays(
>   File "/home/antoine/arrow/dev/python/pyarrow/pandas_compat.py", line 480, 
> in dataframe_to_arrays
> types)
>   File "/home/antoine/arrow/dev/python/pyarrow/pandas_compat.py", line 209, 
> in construct_metadata
> field_name=sanitized_name)
>   File "/home/antoine/arrow/dev/python/pyarrow/pandas_compat.py", line 151, 
> in get_column_metadata
> logical_type = get_logical_type(arrow_type)
>   File "/home/antoine/arrow/dev/python/pyarrow/pandas_compat.py", line 79, in 
> get_logical_type
> raise NotImplementedError(str(arrow_type))
> NotImplementedError: struct
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5156) [Python] `df.to_parquet('s3://...', partition_cols=...)` fails with `'NoneType' object has no attribute '_isfilestore'`

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5156:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [Python] `df.to_parquet('s3://...', partition_cols=...)` fails with 
> `'NoneType' object has no attribute '_isfilestore'`
> ---
>
> Key: ARROW-5156
> URL: https://issues.apache.org/jira/browse/ARROW-5156
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.1
> Environment: Mac, Linux
>Reporter: Victor Shih
>Priority: Major
>  Labels: parquet
> Fix For: 0.15.0
>
>
> According to 
> [https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#partitioning-parquet-files],
>  writing a parquet to S3 with `partition_cols` should work, but it fails for 
> me. Example script:
> {code:java}
> import pandas as pd
> import sys
> print(sys.version)
> print(pd._version_)
> df = pd.DataFrame([{'a': 1, 'b': 2}])
> df.to_parquet('s3://my_s3_bucket/x.parquet', engine='pyarrow')
> print('OK 1')
> df.to_parquet('s3://my_s3_bucket/x2.parquet', partition_cols=['a'], 
> engine='pyarrow')
> print('OK 2')
> {code}
> Output:
> {noformat}
> 3.5.2 (default, Feb 14 2019, 01:46:27)
> [GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)]
> 0.24.2
> OK 1
> Traceback (most recent call last):
> File "./t.py", line 14, in 
> df.to_parquet('s3://my_s3_bucket/x2.parquet', partition_cols=['a'], 
> engine='pyarrow')
> File 
> "/Users/vshih/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pandas/core/frame.py",
>  line 2203, in to_parquet
> partition_cols=partition_cols, **kwargs)
> File 
> "/Users/vshih/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pandas/io/parquet.py",
>  line 252, in to_parquet
> partition_cols=partition_cols, **kwargs)
> File 
> "/Users/vshih/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pandas/io/parquet.py",
>  line 118, in write
> partition_cols=partition_cols, **kwargs)
> File 
> "/Users/vshih/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pyarrow/parquet.py",
>  line 1227, in write_to_dataset
> _mkdir_if_not_exists(fs, root_path)
> File 
> "/Users/vshih/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pyarrow/parquet.py",
>  line 1182, in _mkdir_if_not_exists
> if fs._isfilestore() and not fs.exists(path):
> AttributeError: 'NoneType' object has no attribute '_isfilestore'
> {noformat}
>  
> Original issue - [https://github.com/apache/arrow/issues/4030]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5141) [C++] Share more of the IPC testing utils with the rest of Arrow

2019-08-22 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913745#comment-16913745
 ] 

Wes McKinney commented on ARROW-5141:
-

Can we move the data examples to {{arrow/testing/example_data.h}} or something?

> [C++] Share more of the IPC testing utils with the rest of Arrow
> 
>
> Key: ARROW-5141
> URL: https://issues.apache.org/jira/browse/ARROW-5141
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.13.0
>Reporter: Antoine Pitrou
>Priority: Minor
> Fix For: 0.15.0
>
>
> Some APIs in {{arrow/ipc/test-common.h}} aren't really IPC-specific. 
> Furthermore, {{arrow/ipc/test-common.h}} is already included in non-IPC 
> tests. Those APIs should be moved to the Arrow-wide testing utilities.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5141) [C++] Share more of the IPC testing utils with the rest of Arrow

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5141:

Fix Version/s: 0.15.0

> [C++] Share more of the IPC testing utils with the rest of Arrow
> 
>
> Key: ARROW-5141
> URL: https://issues.apache.org/jira/browse/ARROW-5141
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.13.0
>Reporter: Antoine Pitrou
>Priority: Minor
> Fix For: 0.15.0
>
>
> Some APIs in {{arrow/ipc/test-common.h}} aren't really IPC-specific. 
> Furthermore, {{arrow/ipc/test-common.h}} is already included in non-IPC 
> tests. Those APIs should be moved to the Arrow-wide testing utilities.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5139) [Python/C++] Empty column selection no longer restores index

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5139:

Fix Version/s: 0.15.0

> [Python/C++] Empty column selection no longer restores index
> 
>
> Key: ARROW-5139
> URL: https://issues.apache.org/jira/browse/ARROW-5139
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.12.1
>Reporter: Florian Jetter
>Priority: Minor
>  Labels: parquet
> Fix For: 0.15.0
>
>
> The index of a dataframe is no longer reconstructed when using empty column 
> selection. This is a regression to 0.12.1 and probably only happens for 
> pd.RangeIndex
> {code:python}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> from kartothek.serialization import ParquetSerializer
> from storefact import get_store_from_url
> print(pa.__version__)
> df = pd.DataFrame(
> {"a": [1, 2]}
> )
> print(df.index)
> table = pa.Table.from_pandas(df)
> buf = pa.BufferOutputStream()
> pq.write_table(table, buf)
> reader = pa.BufferReader(buf.getvalue().to_pybytes())
> table_restored = pq.read_pandas(reader, columns=[])
> df_restored = table_restored.to_pandas()
> print(len(df_restored))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5139) [Python/C++] Empty column selection no longer restores index

2019-08-22 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913743#comment-16913743
 ] 

Wes McKinney commented on ARROW-5139:
-

Added to 0.15.0 in case something can be done

> [Python/C++] Empty column selection no longer restores index
> 
>
> Key: ARROW-5139
> URL: https://issues.apache.org/jira/browse/ARROW-5139
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.12.1
>Reporter: Florian Jetter
>Priority: Minor
>  Labels: parquet
> Fix For: 0.15.0
>
>
> The index of a dataframe is no longer reconstructed when using empty column 
> selection. This is a regression to 0.12.1 and probably only happens for 
> pd.RangeIndex
> {code:python}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> from kartothek.serialization import ParquetSerializer
> from storefact import get_store_from_url
> print(pa.__version__)
> df = pd.DataFrame(
> {"a": [1, 2]}
> )
> print(df.index)
> table = pa.Table.from_pandas(df)
> buf = pa.BufferOutputStream()
> pq.write_table(table, buf)
> reader = pa.BufferReader(buf.getvalue().to_pybytes())
> table_restored = pq.read_pandas(reader, columns=[])
> df_restored = table_restored.to_pandas()
> print(len(df_restored))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5131) [Python] Add Azure Datalake Filesystem Gen1 Wrapper for pyarrow

2019-08-22 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913741#comment-16913741
 ] 

Wes McKinney commented on ARROW-5131:
-

Our medium/long-term plan in Apache Arrow is to support cloud filesystems in 
C++. See initial steps in this direction to support Amazon S3 
https://github.com/apache/arrow/pull/5167

> [Python] Add Azure Datalake Filesystem Gen1 Wrapper for pyarrow
> ---
>
> Key: ARROW-5131
> URL: https://issues.apache.org/jira/browse/ARROW-5131
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Affects Versions: 0.12.1
>Reporter: Gregory Hayes
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> The current pyarrow package can only read parquet files that have been 
> written to Gen1 Azure Datalake using the fastparquet engine.  This only works 
> if the dask-adlfs package is explicitly installed and imported.  I've added a 
> method to the dask-adlfs package, found 
> [here|https://github.com/dask/dask-adlfs], and issued a PR for that change.  
> To support this capability, added an ADLFSWrapper to filesystem.py file.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5125) [Python] Cannot roundtrip extreme dates through pyarrow

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5125:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [Python] Cannot roundtrip extreme dates through pyarrow
> ---
>
> Key: ARROW-5125
> URL: https://issues.apache.org/jira/browse/ARROW-5125
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
> Environment: Windows 10, Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 
> 2019, 22:22:05)
>Reporter: Max Bolingbroke
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: parquet, windows
> Fix For: 0.15.0
>
>
> You can roundtrip many dates through a pyarrow array:
>  
> {noformat}
> >>> pa.array([datetime.date(1980, 1, 1)], type=pa.date32())[0]
> datetime.date(1980, 1, 1){noformat}
>  
> But (on Windows at least), not extreme ones:
>  
> {noformat}
> >>> pa.array([datetime.date(1960, 1, 1)], type=pa.date32())[0]
> Traceback (most recent call last):
>  File "", line 1, in 
>  File "pyarrow\scalar.pxi", line 74, in pyarrow.lib.ArrayValue.__repr__
>  File "pyarrow\scalar.pxi", line 226, in pyarrow.lib.Date32Value.as_py
> OSError: [Errno 22] Invalid argument
> >>> pa.array([datetime.date(3200, 1, 1)], type=pa.date32())[0]
> Traceback (most recent call last):
>  File "", line 1, in 
>  File "pyarrow\scalar.pxi", line 74, in pyarrow.lib.ArrayValue.__repr__
>  File "pyarrow\scalar.pxi", line 226, in pyarrow.lib.Date32Value.as_py
> {noformat}
> This is because datetime.utcfromtimestamp and datetime.timestamp fail on 
> these dates, but it seems we should be able to totally avoid invoking this 
> function when deserializing dates. Ideally we would be able to roundtrip 
> these as datetimes too, of course, but it's less clear that this will be 
> easy. For some context on this see [https://bugs.python.org/issue29097].
> This may be related to ARROW-3176 and ARROW-4746



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5125) [Python] Cannot roundtrip extreme dates through pyarrow

2019-08-22 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913740#comment-16913740
 ] 

Wes McKinney commented on ARROW-5125:
-

I added to 0.15.0 in case it can be fixed. The fix is probably not _too_ complex

> [Python] Cannot roundtrip extreme dates through pyarrow
> ---
>
> Key: ARROW-5125
> URL: https://issues.apache.org/jira/browse/ARROW-5125
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
> Environment: Windows 10, Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 
> 2019, 22:22:05)
>Reporter: Max Bolingbroke
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: parquet, windows
> Fix For: 0.15.0
>
>
> You can roundtrip many dates through a pyarrow array:
>  
> {noformat}
> >>> pa.array([datetime.date(1980, 1, 1)], type=pa.date32())[0]
> datetime.date(1980, 1, 1){noformat}
>  
> But (on Windows at least), not extreme ones:
>  
> {noformat}
> >>> pa.array([datetime.date(1960, 1, 1)], type=pa.date32())[0]
> Traceback (most recent call last):
>  File "", line 1, in 
>  File "pyarrow\scalar.pxi", line 74, in pyarrow.lib.ArrayValue.__repr__
>  File "pyarrow\scalar.pxi", line 226, in pyarrow.lib.Date32Value.as_py
> OSError: [Errno 22] Invalid argument
> >>> pa.array([datetime.date(3200, 1, 1)], type=pa.date32())[0]
> Traceback (most recent call last):
>  File "", line 1, in 
>  File "pyarrow\scalar.pxi", line 74, in pyarrow.lib.ArrayValue.__repr__
>  File "pyarrow\scalar.pxi", line 226, in pyarrow.lib.Date32Value.as_py
> {noformat}
> This is because datetime.utcfromtimestamp and datetime.timestamp fail on 
> these dates, but it seems we should be able to totally avoid invoking this 
> function when deserializing dates. Ideally we would be able to roundtrip 
> these as datetimes too, of course, but it's less clear that this will be 
> easy. For some context on this see [https://bugs.python.org/issue29097].
> This may be related to ARROW-3176 and ARROW-4746



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5121) [C++] arrow::internal::make_unique conflicts std::make_unique on MSVC

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5121:

Fix Version/s: 0.15.0

> [C++] arrow::internal::make_unique conflicts std::make_unique on MSVC
> -
>
> Key: ARROW-5121
> URL: https://issues.apache.org/jira/browse/ARROW-5121
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Benjamin Kietzman
>Assignee: Benjamin Kietzman
>Priority: Minor
> Fix For: 0.15.0
>
>
> MSVC appears to implement c++20 ADL, which includes function templates with 
> explicit template arguments (previously these were not looked up through ADL):
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/23604480/job/psvu16jasktacvy2#L2097



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5121) [C++] arrow::internal::make_unique conflicts std::make_unique on MSVC

2019-08-22 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913739#comment-16913739
 ] 

Wes McKinney commented on ARROW-5121:
-

cc [~pitrou]

> [C++] arrow::internal::make_unique conflicts std::make_unique on MSVC
> -
>
> Key: ARROW-5121
> URL: https://issues.apache.org/jira/browse/ARROW-5121
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Benjamin Kietzman
>Assignee: Benjamin Kietzman
>Priority: Minor
> Fix For: 0.15.0
>
>
> MSVC appears to implement c++20 ADL, which includes function templates with 
> explicit template arguments (previously these were not looked up through ADL):
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/23604480/job/psvu16jasktacvy2#L2097



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5106) [Packaging] [C++/Python] Add conda package verification scripts

2019-08-22 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5106:

Fix Version/s: 1.0.0

> [Packaging] [C++/Python] Add conda package verification scripts
> ---
>
> Key: ARROW-5106
> URL: https://issues.apache.org/jira/browse/ARROW-5106
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Krisztian Szucs
>Priority: Major
> Fix For: 1.0.0
>
>
> Following the conventions of apt/yum verification script: 
> https://github.com/apache/arrow/pull/4098



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

1 2 >

1 - 100 of 169 matches

Mail list logo