[jira] [Updated] (ARROW-7102) Make filesystem wrappers compatible with fsspec

2019-11-08 Thread Tom Augspurger (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Augspurger updated ARROW-7102:
--
Issue Type: Improvement  (was: Bug)

> Make filesystem wrappers compatible with fsspec
> ---
>
> Key: ARROW-7102
> URL: https://issues.apache.org/jira/browse/ARROW-7102
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Tom Augspurger
>Priority: Major
>  Labels: FileSystem
>
> [fsspec|fsspec: https://filesystem-spec.readthedocs.io/en/latest/] defines a 
> common API for a variety filesystem implementations. I'm proposing a 
> FSSpecWrapper, similar to S3FSWrapper, that works with any fsspec 
> implementation.
>  
> Right now, pyarrow has a pyarrow.filesystems.S3FSWrapper, which is specific 
> to s3fs. 
> [https://github.com/apache/arrow/blob/21ad7ac1162eab188a1e15923fb1de5b795337ec/python/pyarrow/filesystem.py#L320].
>  This implementation could be removed entirely once an FSSPecWrapper is done, 
> or kept as an alias if it's part of the public API.
>  
> This is realted to ARROW-3717, which requested a GCSFSWrapper for working 
> with google cloud storage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7102) Make filesystem wrappers compatible with fsspec

2019-11-08 Thread Tom Augspurger (Jira)
Tom Augspurger created ARROW-7102:
-

 Summary: Make filesystem wrappers compatible with fsspec
 Key: ARROW-7102
 URL: https://issues.apache.org/jira/browse/ARROW-7102
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Tom Augspurger


[fsspec|fsspec: https://filesystem-spec.readthedocs.io/en/latest/] defines a 
common API for a variety filesystem implementations. I'm proposing a 
FSSpecWrapper, similar to S3FSWrapper, that works with any fsspec 
implementation.

 

Right now, pyarrow has a pyarrow.filesystems.S3FSWrapper, which is specific to 
s3fs. 
[https://github.com/apache/arrow/blob/21ad7ac1162eab188a1e15923fb1de5b795337ec/python/pyarrow/filesystem.py#L320].
 This implementation could be removed entirely once an FSSPecWrapper is done, 
or kept as an alias if it's part of the public API.

 

This is realted to ARROW-3717, which requested a GCSFSWrapper for working with 
google cloud storage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7102) Make filesystem wrappers compatible with fsspec

2019-11-08 Thread Tom Augspurger (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Augspurger updated ARROW-7102:
--
Description: 
[fsspec|https://filesystem-spec.readthedocs.io/en/latest] defines a common API 
for a variety filesystem implementations. I'm proposing a FSSpecWrapper, 
similar to S3FSWrapper, that works with any fsspec implementation.

 

Right now, pyarrow has a pyarrow.filesystems.S3FSWrapper, which is specific to 
s3fs. 
[https://github.com/apache/arrow/blob/21ad7ac1162eab188a1e15923fb1de5b795337ec/python/pyarrow/filesystem.py#L320].
 This implementation could be removed entirely once an FSSPecWrapper is done, 
or kept as an alias if it's part of the public API.

 

This is realted to ARROW-3717, which requested a GCSFSWrapper for working with 
google cloud storage.

  was:
[fsspec|fsspec: https://filesystem-spec.readthedocs.io/en/latest/] defines a 
common API for a variety filesystem implementations. I'm proposing a 
FSSpecWrapper, similar to S3FSWrapper, that works with any fsspec 
implementation.

 

Right now, pyarrow has a pyarrow.filesystems.S3FSWrapper, which is specific to 
s3fs. 
[https://github.com/apache/arrow/blob/21ad7ac1162eab188a1e15923fb1de5b795337ec/python/pyarrow/filesystem.py#L320].
 This implementation could be removed entirely once an FSSPecWrapper is done, 
or kept as an alias if it's part of the public API.

 

This is realted to ARROW-3717, which requested a GCSFSWrapper for working with 
google cloud storage.


> Make filesystem wrappers compatible with fsspec
> ---
>
> Key: ARROW-7102
> URL: https://issues.apache.org/jira/browse/ARROW-7102
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Tom Augspurger
>Priority: Major
>  Labels: FileSystem
>
> [fsspec|https://filesystem-spec.readthedocs.io/en/latest] defines a common 
> API for a variety filesystem implementations. I'm proposing a FSSpecWrapper, 
> similar to S3FSWrapper, that works with any fsspec implementation.
>  
> Right now, pyarrow has a pyarrow.filesystems.S3FSWrapper, which is specific 
> to s3fs. 
> [https://github.com/apache/arrow/blob/21ad7ac1162eab188a1e15923fb1de5b795337ec/python/pyarrow/filesystem.py#L320].
>  This implementation could be removed entirely once an FSSPecWrapper is done, 
> or kept as an alias if it's part of the public API.
>  
> This is realted to ARROW-3717, which requested a GCSFSWrapper for working 
> with google cloud storage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-1897) Incorrect numpy_type for pandas metadata of Categoricals

2017-12-07 Thread Tom Augspurger (JIRA)
Tom Augspurger created ARROW-1897:
-

 Summary: Incorrect numpy_type for pandas metadata of Categoricals
 Key: ARROW-1897
 URL: https://issues.apache.org/jira/browse/ARROW-1897
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.8.0
Reporter: Tom Augspurger
 Fix For: 0.9.0


If I'm reading 
http://pandas-docs.github.io/pandas-docs-travis/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format
 correctly, the "numpy_type" field of a `Categorical` should be the storage 
type used for the *codes*. It looks like pyarrow is just using 'object' always.

{{
In [1]: import pandas as pd

In [2]: import pyarrow as pa

In [3]: import pyarrow.parquet as pq

In [4]: import io

In [5]: import json

In [6]: df = pd.DataFrame({"A": [1, 2]},
   ...:   index=pd.CategoricalIndex(['one', 'two'], name='idx'))
   ...:
In [8]: sink = io.BytesIO()
   ...: pq.write_metadata(pa.Table.from_pandas(df).schema, sink)
   ...: 
json.loads(pq.read_metadata(sink).metadata[b'pandas'].decode('utf-8'))['columns'][-1]
   ...:
Out[8]:
{'field_name': '__index_level_0__',
 'metadata': {'num_categories': 2, 'ordered': False},
 'name': 'idx',
 'numpy_type': 'object',
 'pandas_type': 'categorical'}

}}

>From the spec:

> The numpy_type is the physical storage type of the column, which is the 
> result of str(dtype) for the underlying NumPy array that holds the data. So 
> for datetimetz this is datetime64[ns] and for categorical, it may be any of 
> the supported integer categorical types.

So 'numpy_type' field should be something like `'int8'` instead of `'object'`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1897) Incorrect numpy_type for pandas metadata of Categoricals

2017-12-07 Thread Tom Augspurger (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Augspurger updated ARROW-1897:
--
Description: 
If I'm reading 
http://pandas-docs.github.io/pandas-docs-travis/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format
 correctly, the "numpy_type" field of a `Categorical` should be the storage 
type used for the *codes*. It looks like pyarrow is just using 'object' always.

{code
In [1]: import pandas as pd

In [2]: import pyarrow as pa

In [3]: import pyarrow.parquet as pq

In [4]: import io

In [5]: import json

In [6]: df = pd.DataFrame({"A": [1, 2]},
   ...:   index=pd.CategoricalIndex(['one', 'two'], name='idx'))
   ...:
In [8]: sink = io.BytesIO()
   ...: pq.write_metadata(pa.Table.from_pandas(df).schema, sink)
   ...: 
json.loads(pq.read_metadata(sink).metadata[b'pandas'].decode('utf-8'))['columns'][-1]
   ...:
Out[8]:
{'field_name': '__index_level_0__',
 'metadata': {'num_categories': 2, 'ordered': False},
 'name': 'idx',
 'numpy_type': 'object',
 'pandas_type': 'categorical'}

}

>From the spec:

> The numpy_type is the physical storage type of the column, which is the 
> result of str(dtype) for the underlying NumPy array that holds the data. So 
> for datetimetz this is datetime64[ns] and for categorical, it may be any of 
> the supported integer categorical types.

So 'numpy_type' field should be something like `'int8'` instead of `'object'`

  was:
If I'm reading 
http://pandas-docs.github.io/pandas-docs-travis/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format
 correctly, the "numpy_type" field of a `Categorical` should be the storage 
type used for the *codes*. It looks like pyarrow is just using 'object' always.

{{
In [1]: import pandas as pd

In [2]: import pyarrow as pa

In [3]: import pyarrow.parquet as pq

In [4]: import io

In [5]: import json

In [6]: df = pd.DataFrame({"A": [1, 2]},
   ...:   index=pd.CategoricalIndex(['one', 'two'], name='idx'))
   ...:
In [8]: sink = io.BytesIO()
   ...: pq.write_metadata(pa.Table.from_pandas(df).schema, sink)
   ...: 
json.loads(pq.read_metadata(sink).metadata[b'pandas'].decode('utf-8'))['columns'][-1]
   ...:
Out[8]:
{'field_name': '__index_level_0__',
 'metadata': {'num_categories': 2, 'ordered': False},
 'name': 'idx',
 'numpy_type': 'object',
 'pandas_type': 'categorical'}

}}

>From the spec:

> The numpy_type is the physical storage type of the column, which is the 
> result of str(dtype) for the underlying NumPy array that holds the data. So 
> for datetimetz this is datetime64[ns] and for categorical, it may be any of 
> the supported integer categorical types.

So 'numpy_type' field should be something like `'int8'` instead of `'object'`


> Incorrect numpy_type for pandas metadata of Categoricals
> 
>
> Key: ARROW-1897
> URL: https://issues.apache.org/jira/browse/ARROW-1897
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Tom Augspurger
>  Labels: categorical, metadata, pandas, parquet, pyarrow
> Fix For: 0.9.0
>
>
> If I'm reading 
> http://pandas-docs.github.io/pandas-docs-travis/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format
>  correctly, the "numpy_type" field of a `Categorical` should be the storage 
> type used for the *codes*. It looks like pyarrow is just using 'object' 
> always.
> {code
> In [1]: import pandas as pd
> In [2]: import pyarrow as pa
> In [3]: import pyarrow.parquet as pq
> In [4]: import io
> In [5]: import json
> In [6]: df = pd.DataFrame({"A": [1, 2]},
>...:   index=pd.CategoricalIndex(['one', 'two'], 
> name='idx'))
>...:
> In [8]: sink = io.BytesIO()
>...: pq.write_metadata(pa.Table.from_pandas(df).schema, sink)
>...: 
> json.loads(pq.read_metadata(sink).metadata[b'pandas'].decode('utf-8'))['columns'][-1]
>...:
> Out[8]:
> {'field_name': '__index_level_0__',
>  'metadata': {'num_categories': 2, 'ordered': False},
>  'name': 'idx',
>  'numpy_type': 'object',
>  'pandas_type': 'categorical'}
> }
> From the spec:
> > The numpy_type is the physical storage type of the column, which is the 
> > result of str(dtype) for the underlying NumPy array that holds the data. So 
> > for datetimetz this is datetime64[ns] and for categorical, it may be any of 
> > the supported integer categorical types.
> So 'numpy_type' field should be something like `'int8'` instead of `'object'`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1897) Incorrect numpy_type for pandas metadata of Categoricals

2017-12-07 Thread Tom Augspurger (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Augspurger updated ARROW-1897:
--
Description: 
If I'm reading 
http://pandas-docs.github.io/pandas-docs-travis/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format
 correctly, the "numpy_type" field of a `Categorical` should be the storage 
type used for the *codes*. It looks like pyarrow is just using 'object' always.

{code}
In [1]: import pandas as pd

In [2]: import pyarrow as pa

In [3]: import pyarrow.parquet as pq

In [4]: import io

In [5]: import json

In [6]: df = pd.DataFrame({"A": [1, 2]},
   ...:   index=pd.CategoricalIndex(['one', 'two'], name='idx'))
   ...:
In [8]: sink = io.BytesIO()
   ...: pq.write_metadata(pa.Table.from_pandas(df).schema, sink)
   ...: 
json.loads(pq.read_metadata(sink).metadata[b'pandas'].decode('utf-8'))['columns'][-1]
   ...:
Out[8]:
{'field_name': '__index_level_0__',
 'metadata': {'num_categories': 2, 'ordered': False},
 'name': 'idx',
 'numpy_type': 'object',
 'pandas_type': 'categorical'}
{code}

>From the spec:

bq. The numpy_type is the physical storage type of the column, which is the 
result of str(dtype) for the underlying NumPy array that holds the data. So for 
datetimetz this is datetime64[ns] and for categorical, it may be any of the 
supported integer categorical types.

So 'numpy_type' field should be something like `'int8'` instead of `'object'`

  was:
If I'm reading 
http://pandas-docs.github.io/pandas-docs-travis/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format
 correctly, the "numpy_type" field of a `Categorical` should be the storage 
type used for the *codes*. It looks like pyarrow is just using 'object' always.

{code
In [1]: import pandas as pd

In [2]: import pyarrow as pa

In [3]: import pyarrow.parquet as pq

In [4]: import io

In [5]: import json

In [6]: df = pd.DataFrame({"A": [1, 2]},
   ...:   index=pd.CategoricalIndex(['one', 'two'], name='idx'))
   ...:
In [8]: sink = io.BytesIO()
   ...: pq.write_metadata(pa.Table.from_pandas(df).schema, sink)
   ...: 
json.loads(pq.read_metadata(sink).metadata[b'pandas'].decode('utf-8'))['columns'][-1]
   ...:
Out[8]:
{'field_name': '__index_level_0__',
 'metadata': {'num_categories': 2, 'ordered': False},
 'name': 'idx',
 'numpy_type': 'object',
 'pandas_type': 'categorical'}

}

>From the spec:

> The numpy_type is the physical storage type of the column, which is the 
> result of str(dtype) for the underlying NumPy array that holds the data. So 
> for datetimetz this is datetime64[ns] and for categorical, it may be any of 
> the supported integer categorical types.

So 'numpy_type' field should be something like `'int8'` instead of `'object'`


> Incorrect numpy_type for pandas metadata of Categoricals
> 
>
> Key: ARROW-1897
> URL: https://issues.apache.org/jira/browse/ARROW-1897
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Tom Augspurger
>  Labels: categorical, metadata, pandas, parquet, pyarrow
> Fix For: 0.9.0
>
>
> If I'm reading 
> http://pandas-docs.github.io/pandas-docs-travis/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format
>  correctly, the "numpy_type" field of a `Categorical` should be the storage 
> type used for the *codes*. It looks like pyarrow is just using 'object' 
> always.
> {code}
> In [1]: import pandas as pd
> In [2]: import pyarrow as pa
> In [3]: import pyarrow.parquet as pq
> In [4]: import io
> In [5]: import json
> In [6]: df = pd.DataFrame({"A": [1, 2]},
>...:   index=pd.CategoricalIndex(['one', 'two'], 
> name='idx'))
>...:
> In [8]: sink = io.BytesIO()
>...: pq.write_metadata(pa.Table.from_pandas(df).schema, sink)
>...: 
> json.loads(pq.read_metadata(sink).metadata[b'pandas'].decode('utf-8'))['columns'][-1]
>...:
> Out[8]:
> {'field_name': '__index_level_0__',
>  'metadata': {'num_categories': 2, 'ordered': False},
>  'name': 'idx',
>  'numpy_type': 'object',
>  'pandas_type': 'categorical'}
> {code}
> From the spec:
> bq. The numpy_type is the physical storage type of the column, which is the 
> result of str(dtype) for the underlying NumPy array that holds the data. So 
> for datetimetz this is datetime64[ns] and for categorical, it may be any of 
> the supported integer categorical types.
> So 'numpy_type' field should be something like `'int8'` instead of `'object'`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1580) [Python] Instructions for setting up nightly builds on Linux

2018-01-16 Thread Tom Augspurger (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16327817#comment-16327817
 ] 

Tom Augspurger commented on ARROW-1580:
---

The short version is

 
 # We have a build machine running Ubuntu
 # Phil has a build tool [https://github.com/cpcloud/scourge] that's been setup 
on the build machine
 # That tool is triggered through an airflow scheduler. Airflow (and some 
PyData ASVs) are bootstrapped through 
[https://github.com/tomaugspurger/asv-runner]. The extra step for the arrow 
nightlies was to add a file in ~/airflow/dags so that the scheduler picked it 
up.

> [Python] Instructions for setting up nightly builds on Linux
> 
>
> Key: ARROW-1580
> URL: https://issues.apache.org/jira/browse/ARROW-1580
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Wes McKinney
>Assignee: Phillip Cloud
>Priority: Major
> Fix For: 0.9.0
>
>
> cc [~cpcloud]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-1557) pyarrow.Table.from_arrays doesn't validate names length

2017-09-19 Thread Tom Augspurger (JIRA)
Tom Augspurger created ARROW-1557:
-

 Summary: pyarrow.Table.from_arrays doesn't validate names length
 Key: ARROW-1557
 URL: https://issues.apache.org/jira/browse/ARROW-1557
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.7.0
Reporter: Tom Augspurger
Priority: Minor


pa.Table.from_arrays doesn't validate that the length of {{arrays}} and 
{{names}} matches. I think this should raise with a {{ValueError}}:

{{
In [1]: import pyarrow as pa

In [2]: pa.Table.from_arrays([pa.array([1, 2]), pa.array([3, 4])], names=['a', 
'b', 'c'])
Out[2]:
pyarrow.Table
a: int64
b: int64

In [3]: pa.__version__
Out[3]: '0.7.0'
}}

(This is my first time using JIRA, hopefully I didn't mess up too badly)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1557) [PYTHON] pyarrow.Table.from_arrays doesn't validate names length

2017-09-19 Thread Tom Augspurger (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Augspurger updated ARROW-1557:
--
Summary: [PYTHON] pyarrow.Table.from_arrays doesn't validate names length  
(was: pyarrow.Table.from_arrays doesn't validate names length)

> [PYTHON] pyarrow.Table.from_arrays doesn't validate names length
> 
>
> Key: ARROW-1557
> URL: https://issues.apache.org/jira/browse/ARROW-1557
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
>Reporter: Tom Augspurger
>Priority: Minor
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> pa.Table.from_arrays doesn't validate that the length of {{arrays}} and 
> {{names}} matches. I think this should raise with a {{ValueError}}:
> {{
> In [1]: import pyarrow as pa
> In [2]: pa.Table.from_arrays([pa.array([1, 2]), pa.array([3, 4])], 
> names=['a', 'b', 'c'])
> Out[2]:
> pyarrow.Table
> a: int64
> b: int64
> In [3]: pa.__version__
> Out[3]: '0.7.0'
> }}
> (This is my first time using JIRA, hopefully I didn't mess up too badly)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1557) [PYTHON] pyarrow.Table.from_arrays doesn't validate names length

2017-09-19 Thread Tom Augspurger (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172007#comment-16172007
 ] 

Tom Augspurger commented on ARROW-1557:
---

I can probably submit a fix on Thursday or Friday.

> [PYTHON] pyarrow.Table.from_arrays doesn't validate names length
> 
>
> Key: ARROW-1557
> URL: https://issues.apache.org/jira/browse/ARROW-1557
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
>Reporter: Tom Augspurger
>Priority: Minor
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> pa.Table.from_arrays doesn't validate that the length of {{arrays}} and 
> {{names}} matches. I think this should raise with a {{ValueError}}:
> {{
> In [1]: import pyarrow as pa
> In [2]: pa.Table.from_arrays([pa.array([1, 2]), pa.array([3, 4])], 
> names=['a', 'b', 'c'])
> Out[2]:
> pyarrow.Table
> a: int64
> b: int64
> In [3]: pa.__version__
> Out[3]: '0.7.0'
> }}
> (This is my first time using JIRA, hopefully I didn't mess up too badly)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1584) [PYTHON] serialize_pandas on empty dataframe

2017-09-20 Thread Tom Augspurger (JIRA)
Tom Augspurger created ARROW-1584:
-

 Summary: [PYTHON] serialize_pandas on empty dataframe
 Key: ARROW-1584
 URL: https://issues.apache.org/jira/browse/ARROW-1584
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.7.0
Reporter: Tom Augspurger
Priority: Minor
 Fix For: 0.8.0


This code

{code:python}
import pandas as pd
import pyarrow as pa

pa.serialize_pandas(pd.DataFrame())
{code}

Raises

{code}
---
ArrowNotImplementedError  Traceback (most recent call last)
 in ()
> 1 pa.serialize_pandas(pd.DataFrame())

~/Envs/dask-dev/lib/python3.6/site-packages/pyarrow/ipc.py in 
serialize_pandas(df)
158 sink = pa.BufferOutputStream()
159 writer = pa.RecordBatchStreamWriter(sink, batch.schema)
--> 160 writer.write_batch(batch)
161 writer.close()
162 return sink.get_result()

pyarrow/ipc.pxi in pyarrow.lib._RecordBatchWriter.write_batch 
(/Users/travis/build/apache/arrow-dist/arrow/python/build/temp.macosx-10.6-intel-3.6/lib.cxx:59238)()

pyarrow/error.pxi in pyarrow.lib.check_status 
(/Users/travis/build/apache/arrow-dist/arrow/python/build/temp.macosx-10.6-intel-3.6/lib.cxx:8113)()

ArrowNotImplementedError: Unable to convert type: null

{code}

Presumably {{pa.deserialize_pandas}} will need a fix as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1585) serialize_pandas round trip fails on integer columns

2017-09-20 Thread Tom Augspurger (JIRA)
Tom Augspurger created ARROW-1585:
-

 Summary: serialize_pandas round trip fails on integer columns
 Key: ARROW-1585
 URL: https://issues.apache.org/jira/browse/ARROW-1585
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.7.0
Reporter: Tom Augspurger
Priority: Minor
 Fix For: 0.8.0


This roundtrip fails, since the Integer column isn't converted to a string 
after deserializing

{code:python}
In [1]: import pandas as pd
im
In [2]: import pyarrow as pa

In [3]: pa.deserialize_pandas(pa.serialize_pandas(pd.DataFrame({"0": [1, 
2]}))).columns
Out[3]: Index(['0'], dtype='object')
{code}

That should be an {{ Int64Index([0]) }} for the columns.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1586) [PYTHON] serialize_pandas roundtrip loses columns name

2017-09-20 Thread Tom Augspurger (JIRA)
Tom Augspurger created ARROW-1586:
-

 Summary: [PYTHON] serialize_pandas roundtrip loses columns name
 Key: ARROW-1586
 URL: https://issues.apache.org/jira/browse/ARROW-1586
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.7.0
Reporter: Tom Augspurger
Priority: Minor
 Fix For: 0.8.0


The serialize / deserialize roundtrip loses {{ df.columns.name }}

{code:python}
In [1]: import pandas as pd

In [2]: import pyarrow as pa

In [3]: df = pd.DataFrame([[1, 2]], columns=pd.Index(['a', 'b'], 
name='col_name'))

In [4]: df.columns.name
Out[4]: 'col_name'

In [5]: pa.deserialize_pandas(pa.serialize_pandas(df)).columns.name
{code}

Is this in scope for pyarrow? I suspect it would require an update to the 
pandas section of the Schema metadata.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1585) serialize_pandas round trip fails on integer columns

2017-09-20 Thread Tom Augspurger (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16174086#comment-16174086
 ] 

Tom Augspurger commented on ARROW-1585:
---

Sorry, yes, I meant for the original data to be {{ pd.DataFrame({0: [1, 
2]}))).columns }} (an int, not a string).

Agreed that restricting field names to strings is best. Being able to 
reconstruct the original from the metadata is sufficient.

> serialize_pandas round trip fails on integer columns
> 
>
> Key: ARROW-1585
> URL: https://issues.apache.org/jira/browse/ARROW-1585
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
>Reporter: Tom Augspurger
>Priority: Minor
> Fix For: 0.8.0
>
>
> This roundtrip fails, since the Integer column isn't converted to a string 
> after deserializing
> {code:python}
> In [1]: import pandas as pd
> im
> In [2]: import pyarrow as pa
> In [3]: pa.deserialize_pandas(pa.serialize_pandas(pd.DataFrame({"0": [1, 
> 2]}))).columns
> Out[3]: Index(['0'], dtype='object')
> {code}
> That should be an {{ Int64Index([0]) }} for the columns.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (ARROW-1585) serialize_pandas round trip fails on integer columns

2017-09-20 Thread Tom Augspurger (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16174086#comment-16174086
 ] 

Tom Augspurger edited comment on ARROW-1585 at 9/21/17 1:11 AM:


Sorry, yes, I meant for the original data to be {{ pd.DataFrame({0: [1, 2]}) }} 
(an int, not a string).

Agreed that restricting field names to strings is best. Being able to 
reconstruct the original from the metadata is sufficient.


was (Author: tomaugspurger):
Sorry, yes, I meant for the original data to be {{ pd.DataFrame({0: [1, 
2]}))).columns }} (an int, not a string).

Agreed that restricting field names to strings is best. Being able to 
reconstruct the original from the metadata is sufficient.

> serialize_pandas round trip fails on integer columns
> 
>
> Key: ARROW-1585
> URL: https://issues.apache.org/jira/browse/ARROW-1585
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
>Reporter: Tom Augspurger
>Priority: Minor
> Fix For: 0.8.0
>
>
> This roundtrip fails, since the Integer column isn't converted to a string 
> after deserializing
> {code:python}
> In [1]: import pandas as pd
> im
> In [2]: import pyarrow as pa
> In [3]: pa.deserialize_pandas(pa.serialize_pandas(pd.DataFrame({"0": [1, 
> 2]}))).columns
> Out[3]: Index(['0'], dtype='object')
> {code}
> That should be an {{ Int64Index([0]) }} for the columns.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1593) [PYTHON] serialize_pandas should pass through the preserve_index keyword

2017-09-21 Thread Tom Augspurger (JIRA)
Tom Augspurger created ARROW-1593:
-

 Summary: [PYTHON] serialize_pandas should pass through the 
preserve_index keyword
 Key: ARROW-1593
 URL: https://issues.apache.org/jira/browse/ARROW-1593
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 0.7.0
Reporter: Tom Augspurger
Assignee: Tom Augspurger
Priority: Minor
 Fix For: 0.8.0


I'm doing some benchmarking of Arrow serialization for dask.distributed to 
serialize dataframes.

Overall things look good compared to the current implementation (using pickle). 
The biggest difference was pickle's ability to use pandas' RangeIndex to avoid 
serializing the entire Index of values when possible.

I suspect that a "range type" isn't in scope for arrow, but in the meantime 
applications using Arrow could detect the `RangeIndex`, and pass {{ 
pyarrow.serialize_pandas(df, preserve_index=False) }} 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-2667) [C++/Python] Add pandas-like take method to Array/Column/ChunkedArray

2018-06-02 Thread Tom Augspurger (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16499142#comment-16499142
 ] 

Tom Augspurger commented on ARROW-2667:
---

Note that pandas' `take` is a bit complicated by trying to satisfy two APIs 
simultaneously.

 

There's the NumPy-style take from 
[https://docs.scipy.org/doc/numpy/reference/generated/numpy.take.html,] where 
negative indices mean slices from the end.

 

And then there's the "pandas" style `take` where `-1` means "indicator for 
missing values, which should be filled with the `na_value` parameter." Other 
negative numbers are not allowed.

 

I'm not sure which is more appropriate for Arrow, but wanted to share a bit of 
background.

> [C++/Python] Add pandas-like take method to Array/Column/ChunkedArray
> -
>
> Key: ARROW-2667
> URL: https://issues.apache.org/jira/browse/ARROW-2667
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Uwe L. Korn
>Priority: Major
>
> We should add a {{take}} method to {{Array/ChunkedArray/Column}} that takes a 
> list of indices and returns a reordered array.
> For reference, see Pandas' interface: 
> https://github.com/pandas-dev/pandas/blob/2cbdd9a2cd19501c98582490e35c5402ae6de941/pandas/core/arrays/base.py#L466



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-8462) Crash in lib.concat_tables on Windows

2020-04-14 Thread Tom Augspurger (Jira)
Tom Augspurger created ARROW-8462:
-

 Summary: Crash in lib.concat_tables on Windows
 Key: ARROW-8462
 URL: https://issues.apache.org/jira/browse/ARROW-8462
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.16.0
Reporter: Tom Augspurger


This crashes for me with pyarrow 0.16 on my Windows VM


{{
import pyarrow as pa
import pandas as pd

t = pa.Table.from_pandas(pd.DataFrame({"A": [1, 2]}))
print("concat")
pa.lib.concat_tables([t])

print('done')
}}

Installed pyarrow from conda-forge. I'm not really sure how to get more debug 
info on windows unfortunately. With `python -X faulthandler` I see

{{
concat
Windows fatal exception: access violation

Current thread 0x04f8 (most recent call first):
  File "bug.py", line 6 in (module)
}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8462) Crash in lib.concat_tables on Windows

2020-04-14 Thread Tom Augspurger (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Augspurger updated ARROW-8462:
--
Description: 
This crashes for me with pyarrow 0.16 on my Windows VM


{{import pyarrow as pa
import pandas as pd

t = pa.Table.from_pandas(pd.DataFrame({"A": [1, 2]}))
print("concat")
pa.lib.concat_tables([t])

print('done')
}}

Installed pyarrow from conda-forge. I'm not really sure how to get more debug 
info on windows unfortunately. With `python -X faulthandler` I see

{{concat
Windows fatal exception: access violation

Current thread 0x04f8 (most recent call first):
  File "bug.py", line 6 in (module)
}}

  was:
This crashes for me with pyarrow 0.16 on my Windows VM


{{
import pyarrow as pa
import pandas as pd

t = pa.Table.from_pandas(pd.DataFrame({"A": [1, 2]}))
print("concat")
pa.lib.concat_tables([t])

print('done')
}}

Installed pyarrow from conda-forge. I'm not really sure how to get more debug 
info on windows unfortunately. With `python -X faulthandler` I see

{{
concat
Windows fatal exception: access violation

Current thread 0x04f8 (most recent call first):
  File "bug.py", line 6 in (module)
}}


> Crash in lib.concat_tables on Windows
> -
>
> Key: ARROW-8462
> URL: https://issues.apache.org/jira/browse/ARROW-8462
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.16.0
>Reporter: Tom Augspurger
>Priority: Major
>
> This crashes for me with pyarrow 0.16 on my Windows VM
> {{import pyarrow as pa
> import pandas as pd
> t = pa.Table.from_pandas(pd.DataFrame({"A": [1, 2]}))
> print("concat")
> pa.lib.concat_tables([t])
> print('done')
> }}
> Installed pyarrow from conda-forge. I'm not really sure how to get more debug 
> info on windows unfortunately. With `python -X faulthandler` I see
> {{concat
> Windows fatal exception: access violation
> Current thread 0x04f8 (most recent call first):
>   File "bug.py", line 6 in (module)
> }}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8462) Crash in lib.concat_tables on Windows

2020-04-15 Thread Tom Augspurger (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17084111#comment-17084111
 ] 

Tom Augspurger commented on ARROW-8462:
---

[~kszucs] I've confirmed that it's fixed with pyarrow 0.16.1.dev552. Thanks!

> Crash in lib.concat_tables on Windows
> -
>
> Key: ARROW-8462
> URL: https://issues.apache.org/jira/browse/ARROW-8462
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.16.0
>Reporter: Tom Augspurger
>Priority: Major
>
> This crashes for me with pyarrow 0.16 on my Windows VM
> {code:python}
> import pyarrow as pa
> import pandas as pd
> t = pa.Table.from_pandas(pd.DataFrame({"A": [1, 2]}))
> print("concat")
> pa.lib.concat_tables([t])
> print('done')
> {code}
> Installed pyarrow from conda-forge. I'm not really sure how to get more debug 
> info on windows unfortunately. With `python -X faulthandler` I see
> {code}
> Windows fatal exception: access violation
> Current thread 0x04f8 (most recent call first):
>   File "bug.py", line 6 in (module)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7782) [Python] Losing index information when using write_to_dataset with partition_cols

2020-06-02 Thread Tom Augspurger (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17123905#comment-17123905
 ] 

Tom Augspurger commented on ARROW-7782:
---

Joris, was this fix included in 0.17.1? Or is it just for 1.0?

> [Python] Losing index information when using write_to_dataset with 
> partition_cols
> -
>
> Key: ARROW-7782
> URL: https://issues.apache.org/jira/browse/ARROW-7782
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: pyarrow==0.15.1
>Reporter: Ludwik Bielczynski
>Assignee: Joris Van den Bossche
>Priority: Major
> Fix For: 1.0.0
>
>
> One cannot save the index when using {{pyarrow.parquet.write_to_dataset()}} 
> with given partition_cols arguments. Here I have created a minimal example 
> which shows the issue:
> {code:java}
>  
> from pathlib import Path
> import pandas as pd
> from pyarrow import Table
> from pyarrow.parquet import write_to_dataset, read_table
> path = Path('/home/user/trials')
> file_name = 'local_database.parquet'
> df = pd.DataFrame({"A": [1, 2, 3], "B": ['a', 'a', 'b']}, 
>   index=pd.Index(['a', 'b', 'c'], 
>   name='idx'))
> table = Table.from_pandas(df)
> write_to_dataset(table, 
>  str(path / file_name), 
>  partition_cols=['B']
> )
> df_read = read_table(str(path / file_name))
> df_read.to_pandas()
> {code}
>  
> The issue is rather important for pandas and dask users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13410) Implement min_max kernel for array[string]

2021-07-20 Thread Tom Augspurger (Jira)
Tom Augspurger created ARROW-13410:
--

 Summary: Implement min_max kernel for array[string]
 Key: ARROW-13410
 URL: https://issues.apache.org/jira/browse/ARROW-13410
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++, Python
Affects Versions: 4.0.1
Reporter: Tom Augspurger


As noted in https://github.com/pandas-dev/pandas/issues/42597, 
`pyarrow.compute.min_max` on a string dtype array currently raises. Here's an 
example from Python

{{
In [1]: import pyarrow, pyarrow.compute

In [2]: a = pyarrow.array(['c', 'a', 'b'])

In [4]: pyarrow.compute.min_max(a)
---
ArrowNotImplementedError  Traceback (most recent call last)
 in 
> 1 pyarrow.compute.min_max(a)

~/miniconda3/envs/pandas=1.3.0/lib/python3.9/site-packages/pyarrow/compute.py 
in min_max(array, options, memory_pool, **kwargs)

~/miniconda3/envs/pandas=1.3.0/lib/python3.9/site-packages/pyarrow/_compute.pyx 
in pyarrow._compute.Function.call()

~/miniconda3/envs/pandas=1.3.0/lib/python3.9/site-packages/pyarrow/error.pxi in 
pyarrow.lib.pyarrow_internal_check_status()

~/miniconda3/envs/pandas=1.3.0/lib/python3.9/site-packages/pyarrow/error.pxi in 
pyarrow.lib.check_status()

ArrowNotImplementedError: Function min_max has no kernel matching input types 
(array[string])
}}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)