[jira] [Updated] (ARROW-5419) [C++] CSV strings_can_be_null option doesn't respect all null_values

2019-05-27 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-5419:
-
Labels: csv  (was: )

> [C++] CSV strings_can_be_null option doesn't respect all null_values
> 
>
> Key: ARROW-5419
> URL: https://issues.apache.org/jira/browse/ARROW-5419
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
> Environment: Python 3.6.8
> PyArrow 0.13.1.dev225+g184b8deb
> NumPy 1.16.3
> Pandas 0.24.2
>Reporter: Dennis Waldron
>Priority: Minor
>  Labels: csv
>
> Relates to https://issues.apache.org/jira/browse/ARROW-5195 and 
> [https://github.com/apache/arrow/issues/4184]
> I was testing the new *strings_can_be_null* ConvertOption (built from git 
> 184b8deb651c6f6308c0fa2a595f5a40f5da8ce8) in conjunction with the CSV reader 
> and noted that when enabled and an empty string is parsed that it doesn't 
> return NULL despite '' being in the default null_values list 
> ([https://github.com/apache/arrow/blob/f7ef65e5fc367f1f5649dfcea0754e413fcca394/cpp/src/arrow/csv/options.cc#L28)]
> {code:java}
> options.null_values = {"", "#N/A", "#N/A N/A", "#NA", "-1.#IND", "-1.#QNAN",
> "-NaN", "-nan", "1.#IND", "1.#QNAN", "N/A", "NA",
> "NULL", "NaN", "n/a", "nan", "null"};
> {code}
> Given that the *strings_can_be_null* option was added to expose the same NULL 
> processing functionality with respect to strings as *pandas.read_csv,* I 
> believe that it should also be able to handle empty strings. ** 
> In Pandas:
> {code:java}
> content = b"a,b\n1,null\n2,\n3,test"
> df = pd.read_csv(io.BytesIO(content))
> print(df)
>    a     b
> 0  1   NaN
> 1  2   NaN
> 2  3  test
> {code}
> In PyArrow:
> {code:java}
> convert_options = pc.ConvertOptions(strings_can_be_null=True)
> table = pc.read_csv(io.BytesIO(content), convert_options=convert_options)
> print(table.to_pydict())
> OrderedDict([('a', [1, 2, 3]), ('b', [None, '', 'test'])])
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5419) [C++] CSV strings_can_be_null option doesn't respect all null_values

2019-05-27 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-5419:
-
Description: 
Relates to ARROW-5195 and [https://github.com/apache/arrow/issues/4184]

I was testing the new *strings_can_be_null* ConvertOption (built from git 
184b8deb651c6f6308c0fa2a595f5a40f5da8ce8) in conjunction with the CSV reader 
and noted that when enabled and an empty string is parsed that it doesn't 
return NULL despite '' being in the default null_values list 
([https://github.com/apache/arrow/blob/f7ef65e5fc367f1f5649dfcea0754e413fcca394/cpp/src/arrow/csv/options.cc#L28)]
{code:java}
options.null_values = {"", "#N/A", "#N/A N/A", "#NA", "-1.#IND", "-1.#QNAN",
"-NaN", "-nan", "1.#IND", "1.#QNAN", "N/A", "NA",
"NULL", "NaN", "n/a", "nan", "null"};
{code}
Given that the *strings_can_be_null* option was added to expose the same NULL 
processing functionality with respect to strings as *pandas.read_csv,* I 
believe that it should also be able to handle empty strings. ** 

In Pandas:
{code:java}
content = b"a,b\n1,null\n2,\n3,test"
df = pd.read_csv(io.BytesIO(content))
print(df)
   a     b
0  1   NaN
1  2   NaN
2  3  test
{code}
In PyArrow:
{code:java}
convert_options = pc.ConvertOptions(strings_can_be_null=True)
table = pc.read_csv(io.BytesIO(content), convert_options=convert_options)
print(table.to_pydict())
OrderedDict([('a', [1, 2, 3]), ('b', [None, '', 'test'])])
{code}
 

 

  was:
Relates to https://issues.apache.org/jira/browse/ARROW-5195 and 
[https://github.com/apache/arrow/issues/4184]

I was testing the new *strings_can_be_null* ConvertOption (built from git 
184b8deb651c6f6308c0fa2a595f5a40f5da8ce8) in conjunction with the CSV reader 
and noted that when enabled and an empty string is parsed that it doesn't 
return NULL despite '' being in the default null_values list 
([https://github.com/apache/arrow/blob/f7ef65e5fc367f1f5649dfcea0754e413fcca394/cpp/src/arrow/csv/options.cc#L28)]
{code:java}
options.null_values = {"", "#N/A", "#N/A N/A", "#NA", "-1.#IND", "-1.#QNAN",
"-NaN", "-nan", "1.#IND", "1.#QNAN", "N/A", "NA",
"NULL", "NaN", "n/a", "nan", "null"};
{code}
Given that the *strings_can_be_null* option was added to expose the same NULL 
processing functionality with respect to strings as *pandas.read_csv,* I 
believe that it should also be able to handle empty strings. ** 

In Pandas:
{code:java}
content = b"a,b\n1,null\n2,\n3,test"
df = pd.read_csv(io.BytesIO(content))
print(df)
   a     b
0  1   NaN
1  2   NaN
2  3  test
{code}
In PyArrow:
{code:java}
convert_options = pc.ConvertOptions(strings_can_be_null=True)
table = pc.read_csv(io.BytesIO(content), convert_options=convert_options)
print(table.to_pydict())
OrderedDict([('a', [1, 2, 3]), ('b', [None, '', 'test'])])
{code}
 

 


> [C++] CSV strings_can_be_null option doesn't respect all null_values
> 
>
> Key: ARROW-5419
> URL: https://issues.apache.org/jira/browse/ARROW-5419
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
> Environment: Python 3.6.8
> PyArrow 0.13.1.dev225+g184b8deb
> NumPy 1.16.3
> Pandas 0.24.2
>Reporter: Dennis Waldron
>Priority: Minor
>  Labels: csv
>
> Relates to ARROW-5195 and [https://github.com/apache/arrow/issues/4184]
> I was testing the new *strings_can_be_null* ConvertOption (built from git 
> 184b8deb651c6f6308c0fa2a595f5a40f5da8ce8) in conjunction with the CSV reader 
> and noted that when enabled and an empty string is parsed that it doesn't 
> return NULL despite '' being in the default null_values list 
> ([https://github.com/apache/arrow/blob/f7ef65e5fc367f1f5649dfcea0754e413fcca394/cpp/src/arrow/csv/options.cc#L28)]
> {code:java}
> options.null_values = {"", "#N/A", "#N/A N/A", "#NA", "-1.#IND", "-1.#QNAN",
> "-NaN", "-nan", "1.#IND", "1.#QNAN", "N/A", "NA",
> "NULL", "NaN", "n/a", "nan", "null"};
> {code}
> Given that the *strings_can_be_null* option was added to expose the same NULL 
> processing functionality with respect to strings as *pandas.read_csv,* I 
> believe that it should also be able to handle empty strings. ** 
> In Pandas:
> {code:java}
> content = b"a,b\n1,null\n2,\n3,test"
> df = pd.read_csv(io.BytesIO(content))
> print(df)
>    a     b
> 0  1   NaN
> 1  2   NaN
> 2  3  test
> {code}
> In PyArrow:
> {code:java}
> convert_options = pc.ConvertOptions(strings_can_be_null=True)
> table = pc.read_csv(io.BytesIO(content), convert_options=convert_options)
> print(table.to_pydict())
> OrderedDict([('a', [1, 2, 3]), ('b', [None, '', 'test'])])
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5419) [C++] CSV strings_can_be_null option doesn't respect all null_values

2019-05-27 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-5419:
-
Fix Version/s: 0.14.0

> [C++] CSV strings_can_be_null option doesn't respect all null_values
> 
>
> Key: ARROW-5419
> URL: https://issues.apache.org/jira/browse/ARROW-5419
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
> Environment: Python 3.6.8
> PyArrow 0.13.1.dev225+g184b8deb
> NumPy 1.16.3
> Pandas 0.24.2
>Reporter: Dennis Waldron
>Priority: Minor
>  Labels: csv
> Fix For: 0.14.0
>
>
> Relates to ARROW-5195 and [https://github.com/apache/arrow/issues/4184]
> I was testing the new *strings_can_be_null* ConvertOption (built from git 
> 184b8deb651c6f6308c0fa2a595f5a40f5da8ce8) in conjunction with the CSV reader 
> and noted that when enabled and an empty string is parsed that it doesn't 
> return NULL despite '' being in the default null_values list 
> ([https://github.com/apache/arrow/blob/f7ef65e5fc367f1f5649dfcea0754e413fcca394/cpp/src/arrow/csv/options.cc#L28)]
> {code:java}
> options.null_values = {"", "#N/A", "#N/A N/A", "#NA", "-1.#IND", "-1.#QNAN",
> "-NaN", "-nan", "1.#IND", "1.#QNAN", "N/A", "NA",
> "NULL", "NaN", "n/a", "nan", "null"};
> {code}
> Given that the *strings_can_be_null* option was added to expose the same NULL 
> processing functionality with respect to strings as *pandas.read_csv,* I 
> believe that it should also be able to handle empty strings. ** 
> In Pandas:
> {code:java}
> content = b"a,b\n1,null\n2,\n3,test"
> df = pd.read_csv(io.BytesIO(content))
> print(df)
>    a     b
> 0  1   NaN
> 1  2   NaN
> 2  3  test
> {code}
> In PyArrow:
> {code:java}
> convert_options = pc.ConvertOptions(strings_can_be_null=True)
> table = pc.read_csv(io.BytesIO(content), convert_options=convert_options)
> print(table.to_pydict())
> OrderedDict([('a', [1, 2, 3]), ('b', [None, '', 'test'])])
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5419) [C++] CSV strings_can_be_null option doesn't respect all null_values

2019-05-27 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16848741#comment-16848741
 ] 

Joris Van den Bossche commented on ARROW-5419:
--

As a sidenote: an "empty" field is the default in pandas' {{to_csv}} to 
represent missing values. So it would indeed be good to support this when 
reading back in. 

Related to this I was thinking whether we want to differentiate between empty 
field and actual quoted empty string, but I suppose this is not that easy (as 
once parsed, both are identical?). Also pandas' read_csv does not seem to 
differentiate.

> [C++] CSV strings_can_be_null option doesn't respect all null_values
> 
>
> Key: ARROW-5419
> URL: https://issues.apache.org/jira/browse/ARROW-5419
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
> Environment: Python 3.6.8
> PyArrow 0.13.1.dev225+g184b8deb
> NumPy 1.16.3
> Pandas 0.24.2
>Reporter: Dennis Waldron
>Priority: Minor
>  Labels: csv
>
> Relates to ARROW-5195 and [https://github.com/apache/arrow/issues/4184]
> I was testing the new *strings_can_be_null* ConvertOption (built from git 
> 184b8deb651c6f6308c0fa2a595f5a40f5da8ce8) in conjunction with the CSV reader 
> and noted that when enabled and an empty string is parsed that it doesn't 
> return NULL despite '' being in the default null_values list 
> ([https://github.com/apache/arrow/blob/f7ef65e5fc367f1f5649dfcea0754e413fcca394/cpp/src/arrow/csv/options.cc#L28)]
> {code:java}
> options.null_values = {"", "#N/A", "#N/A N/A", "#NA", "-1.#IND", "-1.#QNAN",
> "-NaN", "-nan", "1.#IND", "1.#QNAN", "N/A", "NA",
> "NULL", "NaN", "n/a", "nan", "null"};
> {code}
> Given that the *strings_can_be_null* option was added to expose the same NULL 
> processing functionality with respect to strings as *pandas.read_csv,* I 
> believe that it should also be able to handle empty strings. ** 
> In Pandas:
> {code:java}
> content = b"a,b\n1,null\n2,\n3,test"
> df = pd.read_csv(io.BytesIO(content))
> print(df)
>    a     b
> 0  1   NaN
> 1  2   NaN
> 2  3  test
> {code}
> In PyArrow:
> {code:java}
> convert_options = pc.ConvertOptions(strings_can_be_null=True)
> table = pc.read_csv(io.BytesIO(content), convert_options=convert_options)
> print(table.to_pydict())
> OrderedDict([('a', [1, 2, 3]), ('b', [None, '', 'test'])])
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5349) [Python/C++] Provide a way to specify the file path in parquet ColumnChunkMetaData

2019-05-27 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-5349:
-
Attachment: test_pyspark_dataset.zip

> [Python/C++] Provide a way to specify the file path in parquet 
> ColumnChunkMetaData
> --
>
> Key: ARROW-5349
> URL: https://issues.apache.org/jira/browse/ARROW-5349
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Assignee: Wes McKinney
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.14.0
>
> Attachments: test_pyspark_dataset.zip
>
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> After ARROW-5258 / https://github.com/apache/arrow/pull/4236 it is now 
> possible to collect the file metadata while writing different files (then how 
> to write those metadata was not yet addressed -> original issue ARROW-1983).
> However, currently, the {{file_path}} information in the ColumnChunkMetaData 
> object is not set. This is, I think, expected / correct for the metadata as 
> included within the single file; but for using the metadata in the combined 
> dataset `_metadata`, it needs a file path set.
> So if you want to use this metadata for a partitioned dataset, there needs to 
> be a way to specify this file path. 
> Ideas I am thinking of currently: either, we could specify a file path to be 
> used when writing, or expose the `set_file_path` method on the Python side so 
> you can create an updated version of the metadata after collecting it.
> cc [~pearu] [~mdurant]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5349) [Python/C++] Provide a way to specify the file path in parquet ColumnChunkMetaData

2019-05-27 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16848753#comment-16848753
 ] 

Joris Van den Bossche commented on ARROW-5349:
--

Summary of the resolution: https://github.com/apache/arrow/pull/4386 resolves 
that one can set the file path on the FileMetadata object returned by the arrow 
parquet writers (through metadata_collector). Which means that a user of the 
python API (such as dask) can set the appropriate file path for the collected 
metadata objects before combining them.

For future reference, here is a small partitioned dataset written by PySpark:  
[^test_pyspark_dataset.zip] 

> [Python/C++] Provide a way to specify the file path in parquet 
> ColumnChunkMetaData
> --
>
> Key: ARROW-5349
> URL: https://issues.apache.org/jira/browse/ARROW-5349
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Assignee: Wes McKinney
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.14.0
>
> Attachments: test_pyspark_dataset.zip
>
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> After ARROW-5258 / https://github.com/apache/arrow/pull/4236 it is now 
> possible to collect the file metadata while writing different files (then how 
> to write those metadata was not yet addressed -> original issue ARROW-1983).
> However, currently, the {{file_path}} information in the ColumnChunkMetaData 
> object is not set. This is, I think, expected / correct for the metadata as 
> included within the single file; but for using the metadata in the combined 
> dataset `_metadata`, it needs a file path set.
> So if you want to use this metadata for a partitioned dataset, there needs to 
> be a way to specify this file path. 
> Ideas I am thinking of currently: either, we could specify a file path to be 
> used when writing, or expose the `set_file_path` method on the Python side so 
> you can create an updated version of the metadata after collecting it.
> cc [~pearu] [~mdurant]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3531) [Python] Deprecate Schema.field_by_name in favor of __getitem__

2019-05-27 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16848892#comment-16848892
 ] 

Joris Van den Bossche commented on ARROW-3531:
--

We may also want to have a {{field()}} method instead of {{field_by_name}} (in 
addition to getitem) ?

To be consistent with {{Table.column()}}.

> [Python] Deprecate Schema.field_by_name in favor of __getitem__ 
> 
>
> Key: ARROW-3531
> URL: https://issues.apache.org/jira/browse/ARROW-3531
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Krisztian Szucs
>Priority: Major
>
> Similarly like https://github.com/apache/arrow/pull/2754



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-5169) [Python] non-nullable fields are converted to nullable in {{Table.from_pandas}}

2019-05-27 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-5169:


Assignee: Joris Van den Bossche

> [Python] non-nullable fields are converted to nullable in 
> {{Table.from_pandas}}
> ---
>
> Key: ARROW-5169
> URL: https://issues.apache.org/jira/browse/ARROW-5169
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
>Reporter: giacomo
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In version 0.13.0, the {{Table.from_pandas}} function modifies the input 
> schema by making all non-nullable types nullable.
> This can cause problems for example with this code:
> {code}
> df = pd.DataFrame(list(range(200)), columns=['numcol'])
> schema = pa.schema([
>  pa.field('numcol', pa.int64(), nullable=False),
> ])
> writer = pq.ParquetWriter(io.BytesIO(), schema, version='2.0')
> table = pa.Table.from_pandas(df, schema=schema)
> writer.write_table(table)
> {code}
> Which fails due to the writer schema and the table schema being different.
> I believe the direct cause could be 
> [https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L622] 
> where nullable is set to True by default, resulting in the table schema being 
> modified.
>  
> Thanks for your valuable work on this library.
> Giacomo



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2019-05-27 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849118#comment-16849118
 ] 

Joris Van den Bossche commented on ARROW-1983:
--

{quote}Correspondingly, please also write a function that parses a 
multiple-metadata file like{quote}

We already have {{read_metadata}} (on the python side, it actually uses the 
normal parquet file reading), that does read such {{_metadata}} files. Are you 
still looking for something else?

To my understanding, it is not really a "multiple-metadata file", but a file 
with a single FileMetadata where the row groups of all metadata objects are 
combined.

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: beginner, parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5427) [Python] RangeIndex serialization change implications

2019-05-27 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5427:


 Summary: [Python] RangeIndex serialization change implications
 Key: ARROW-5427
 URL: https://issues.apache.org/jira/browse/ARROW-5427
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 0.13.0
Reporter: Joris Van den Bossche
 Fix For: 0.14.0


In 0.13, the conversion of a pandas DataFrame's RangeIndex changed: it is no 
longer serialized as an actual column in the arrow table, but only saved as 
metadata (in the pandas metadata) (ARROW-1639).

This change lead to a couple of issues:

- It can sometimes be unpredictable in pandas when you have a RangeIndex and 
when not. Which means that the resulting schema in arrow can be somewhat 
unexpected. See ARROW-5104: empty DataFrame has RangeIndex or not depending on 
how it was created
- The metadata is not always enough (or not updated) to reconstruct it when the 
table has been modified / subsetted.  
  For example, ARROW-5138: retrieving a single row group from parquet file 
doesn't restore index properly (since the RangeIndex metadata was for the full 
table, not this subset)
  And another one, ARROW-5139: empty column selection no longer restores index.

I think we should decide if we either want to try to fix those (or give an 
option to avoid those issues), or either close those as "won't fix".

One idea I had that could potentially alleviate some of those issues:

- Make it possible for the user to still force actual serialization of the 
index, always, even if it is a RangeIndex.
- To not introduce a new option, we could reuse the {{preserve_index}} keyword: 
change the default to None (which means the current behaviour), and change 
{{True}} to mean "always serialize" (although this is not fully backwards 
compatible with 0.13.0 for those users who explicitly specified the keyword).

I am not sure this is worth the added complexity (although I personally like 
providing the option where the index is simply always serialized as columns, 
without surprises). But ideally we decide on it for 0.14, to either fix or 
close the mentioned issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2019-05-27 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849187#comment-16849187
 ] 

Joris Van den Bossche commented on ARROW-1983:
--

I think so yes (at least when reading, it returns a single FileMetadata 
instance with all row groups).

Besides the "append" operation, we also need a "write" method for such 
FileMetadata instance (I suppose this only needs some work on the python/cython 
side, since this is just writing a parquet file without actual data, although 
didn't check C++). There is currently a {{write_metadata}}, but that requires 
an *arrow* schema, and not a *parquet* schema. 
Regarding the public API, I suppose we can modify {{write_metadata}} to also 
accept a parquet schema, to not have to add an extra function. That will need 
some changes under the hood in {{ParquetWriter}} to be able to accept a given 
FileMetadata object.

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: beginner, parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2019-05-27 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849187#comment-16849187
 ] 

Joris Van den Bossche edited comment on ARROW-1983 at 5/27/19 8:29 PM:
---

I think so yes (at least when reading, it returns a single FileMetadata 
instance with all row groups).

Besides the "append" operation, we also need a "write" method for such 
FileMetadata instance (I suppose this only needs some work on the python/cython 
side, since this is just writing a parquet file without actual data, although 
didn't check C++). There is currently a {{write_metadata}}, but that requires 
an _arrow_ schema, and not a _parquet_ schema. 
Regarding the public API, I suppose we can modify {{write_metadata}} to also 
accept a parquet schema, to not have to add an extra function. That will need 
some changes under the hood in {{ParquetWriter}} to be able to accept a given 
FileMetadata object.


was (Author: jorisvandenbossche):
I think so yes (at least when reading, it returns a single FileMetadata 
instance with all row groups).

Besides the "append" operation, we also need a "write" method for such 
FileMetadata instance (I suppose this only needs some work on the python/cython 
side, since this is just writing a parquet file without actual data, although 
didn't check C++). There is currently a {{write_metadata}}, but that requires 
an *arrow* schema, and not a *parquet* schema. 
Regarding the public API, I suppose we can modify {{write_metadata}} to also 
accept a parquet schema, to not have to add an extra function. That will need 
some changes under the hood in {{ParquetWriter}} to be able to accept a given 
FileMetadata object.

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: beginner, parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2019-05-27 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849187#comment-16849187
 ] 

Joris Van den Bossche edited comment on ARROW-1983 at 5/27/19 8:33 PM:
---

I think so yes (at least when reading, it returns a single FileMetadata 
instance with all row groups).

Besides the "append" operation, we also need a "write" method for such 
FileMetadata instance (I suppose this only needs some work on the python/cython 
side, since this is just writing a parquet file without actual data, although 
didn't check C++). There is currently a {{write_metadata}}, but that requires 
an _arrow_ schema, and not a _parquet_ schema. 
Regarding the public API, I suppose we can modify {{write_metadata}} to also 
accept a parquet schema, to not have to add an extra function. But it will need 
some more changes under the hood in {{ParquetWriter}} to be able to accept a 
given FileMetadata object instead of creating one based on the data it is 
writing.


was (Author: jorisvandenbossche):
I think so yes (at least when reading, it returns a single FileMetadata 
instance with all row groups).

Besides the "append" operation, we also need a "write" method for such 
FileMetadata instance (I suppose this only needs some work on the python/cython 
side, since this is just writing a parquet file without actual data, although 
didn't check C++). There is currently a {{write_metadata}}, but that requires 
an _arrow_ schema, and not a _parquet_ schema. 
Regarding the public API, I suppose we can modify {{write_metadata}} to also 
accept a parquet schema, to not have to add an extra function. That will need 
some changes under the hood in {{ParquetWriter}} to be able to accept a given 
FileMetadata object.

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: beginner, parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5430) [Python] Can read but not write parquet partitioned on large ints

2019-05-28 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-5430:
-
Labels: parquet  (was: )

> [Python] Can read but not write parquet partitioned on large ints
> -
>
> Key: ARROW-5430
> URL: https://issues.apache.org/jira/browse/ARROW-5430
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
> Environment: Mac OSX 10.14.4, Python 3.7.1, x86_64.
>Reporter: Robin Kåveland
>Priority: Minor
>  Labels: parquet
>
> Here's a contrived example that reproduces this issue using pandas:
> {code:java}
> import numpy as np
> import pandas as pd
> real_usernames = np.array(['anonymize', 'me'])
> usernames = pd.util.hash_array(real_usernames)
> login_count = [13, 9]
> df = pd.DataFrame({'user': usernames, 'logins': login_count})
> df.to_parquet('can_write.parq', partition_cols=['user'])
> # But not read
> pd.read_parquet('can_write.parq'){code}
> Expected behaviour:
>  * Either the write fails
>  * Or the read succeeds
> Actual behaviour: The read fails with the following error:
> {code:java}
> Traceback (most recent call last):
>   File "", line 2, in 
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pandas/io/parquet.py",
>  line 282, in read_parquet
>     return impl.read(path, columns=columns, **kwargs)
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pandas/io/parquet.py",
>  line 129, in read
>     **kwargs).to_pandas()
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 1152, in read_table
>     use_pandas_metadata=use_pandas_metadata)
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/filesystem.py",
>  line 181, in read_parquet
>     use_pandas_metadata=use_pandas_metadata)
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 1014, in read
>     use_pandas_metadata=use_pandas_metadata)
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 587, in read
>     dictionary = partitions.levels[i].dictionary
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 642, in dictionary
>     dictionary = lib.array(integer_keys)
>   File "pyarrow/array.pxi", line 173, in pyarrow.lib.array
>   File "pyarrow/array.pxi", line 36, in pyarrow.lib._sequence_to_array
>   File "pyarrow/error.pxi", line 104, in pyarrow.lib.check_status
> pyarrow.lib.ArrowException: Unknown error: Python int too large to convert to 
> C long{code}
> I set the priority to minor here because it's easy enough to work around this 
> in user code unless you really need the 64 bit hash (and you probably 
> shouldn't be partitioning on that anyway).
> I could take a stab at writing a patch for this if there's interest?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5430) [Python] Can read but not write parquet partitioned on large ints

2019-05-28 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849483#comment-16849483
 ] 

Joris Van den Bossche commented on ARROW-5430:
--

Thanks for the report! The error is actually not really related to the parquet 
code, but due to pyarrow trying to convert the large integers into a pyarrow 
Array. 
So a smaller example that reproduces the issue:

{code:python}
In [21]: pa.array([14989096668145380166, 15869664087396458664]) 


...
ArrowException: Unknown error: Python int too large to convert to C long

In [22]: pa.array([14989096668145380166, 15869664087396458664], 
type=pa.uint64())   

Out[22]: 

[
  -3457647405564171450,
  -2577079986313092952
]
{code}

So when specifying the type, pyarrow can correctly convert it, but there is 
apparently not yet an automatic inference for uint64.

I think a patch that tries uint64 in case the integers are too big is certainly 
welcome!

> [Python] Can read but not write parquet partitioned on large ints
> -
>
> Key: ARROW-5430
> URL: https://issues.apache.org/jira/browse/ARROW-5430
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
> Environment: Mac OSX 10.14.4, Python 3.7.1, x86_64.
>Reporter: Robin Kåveland
>Priority: Minor
>  Labels: parquet
>
> Here's a contrived example that reproduces this issue using pandas:
> {code:java}
> import numpy as np
> import pandas as pd
> real_usernames = np.array(['anonymize', 'me'])
> usernames = pd.util.hash_array(real_usernames)
> login_count = [13, 9]
> df = pd.DataFrame({'user': usernames, 'logins': login_count})
> df.to_parquet('can_write.parq', partition_cols=['user'])
> # But not read
> pd.read_parquet('can_write.parq'){code}
> Expected behaviour:
>  * Either the write fails
>  * Or the read succeeds
> Actual behaviour: The read fails with the following error:
> {code:java}
> Traceback (most recent call last):
>   File "", line 2, in 
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pandas/io/parquet.py",
>  line 282, in read_parquet
>     return impl.read(path, columns=columns, **kwargs)
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pandas/io/parquet.py",
>  line 129, in read
>     **kwargs).to_pandas()
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 1152, in read_table
>     use_pandas_metadata=use_pandas_metadata)
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/filesystem.py",
>  line 181, in read_parquet
>     use_pandas_metadata=use_pandas_metadata)
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 1014, in read
>     use_pandas_metadata=use_pandas_metadata)
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 587, in read
>     dictionary = partitions.levels[i].dictionary
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 642, in dictionary
>     dictionary = lib.array(integer_keys)
>   File "pyarrow/array.pxi", line 173, in pyarrow.lib.array
>   File "pyarrow/array.pxi", line 36, in pyarrow.lib._sequence_to_array
>   File "pyarrow/error.pxi", line 104, in pyarrow.lib.check_status
> pyarrow.lib.ArrowException: Unknown error: Python int too large to convert to 
> C long{code}
> I set the priority to minor here because it's easy enough to work around this 
> in user code unless you really need the 64 bit hash (and you probably 
> shouldn't be partitioning on that anyway).
> I could take a stab at writing a patch for this if there's interest?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5430) [Python] Can read but not write parquet partitioned on large ints

2019-05-28 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849485#comment-16849485
 ] 

Joris Van den Bossche commented on ARROW-5430:
--

Actually, I see we had ARROW-2972 about this, where we decided to not do this 
automatic inference for unsigned integers (cc [~pitrou])

> [Python] Can read but not write parquet partitioned on large ints
> -
>
> Key: ARROW-5430
> URL: https://issues.apache.org/jira/browse/ARROW-5430
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
> Environment: Mac OSX 10.14.4, Python 3.7.1, x86_64.
>Reporter: Robin Kåveland
>Priority: Minor
>  Labels: parquet
>
> Here's a contrived example that reproduces this issue using pandas:
> {code:java}
> import numpy as np
> import pandas as pd
> real_usernames = np.array(['anonymize', 'me'])
> usernames = pd.util.hash_array(real_usernames)
> login_count = [13, 9]
> df = pd.DataFrame({'user': usernames, 'logins': login_count})
> df.to_parquet('can_write.parq', partition_cols=['user'])
> # But not read
> pd.read_parquet('can_write.parq'){code}
> Expected behaviour:
>  * Either the write fails
>  * Or the read succeeds
> Actual behaviour: The read fails with the following error:
> {code:java}
> Traceback (most recent call last):
>   File "", line 2, in 
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pandas/io/parquet.py",
>  line 282, in read_parquet
>     return impl.read(path, columns=columns, **kwargs)
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pandas/io/parquet.py",
>  line 129, in read
>     **kwargs).to_pandas()
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 1152, in read_table
>     use_pandas_metadata=use_pandas_metadata)
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/filesystem.py",
>  line 181, in read_parquet
>     use_pandas_metadata=use_pandas_metadata)
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 1014, in read
>     use_pandas_metadata=use_pandas_metadata)
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 587, in read
>     dictionary = partitions.levels[i].dictionary
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 642, in dictionary
>     dictionary = lib.array(integer_keys)
>   File "pyarrow/array.pxi", line 173, in pyarrow.lib.array
>   File "pyarrow/array.pxi", line 36, in pyarrow.lib._sequence_to_array
>   File "pyarrow/error.pxi", line 104, in pyarrow.lib.check_status
> pyarrow.lib.ArrowException: Unknown error: Python int too large to convert to 
> C long{code}
> I set the priority to minor here because it's easy enough to work around this 
> in user code unless you really need the 64 bit hash (and you probably 
> shouldn't be partitioning on that anyway).
> I could take a stab at writing a patch for this if there's interest?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5430) [Python] Can read but not write parquet partitioned on large ints

2019-05-28 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849491#comment-16849491
 ] 

Joris Van den Bossche commented on ARROW-5430:
--

I agree that ideally we should either fix it (in pa.array(), so reconsider the 
"wont fix" in ARROW-2972, or do an extra try/except in the parquet code 
specifically for this), or either disallow it. 

> [Python] Can read but not write parquet partitioned on large ints
> -
>
> Key: ARROW-5430
> URL: https://issues.apache.org/jira/browse/ARROW-5430
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
> Environment: Mac OSX 10.14.4, Python 3.7.1, x86_64.
>Reporter: Robin Kåveland
>Priority: Minor
>  Labels: parquet
>
> Here's a contrived example that reproduces this issue using pandas:
> {code:java}
> import numpy as np
> import pandas as pd
> real_usernames = np.array(['anonymize', 'me'])
> usernames = pd.util.hash_array(real_usernames)
> login_count = [13, 9]
> df = pd.DataFrame({'user': usernames, 'logins': login_count})
> df.to_parquet('can_write.parq', partition_cols=['user'])
> # But not read
> pd.read_parquet('can_write.parq'){code}
> Expected behaviour:
>  * Either the write fails
>  * Or the read succeeds
> Actual behaviour: The read fails with the following error:
> {code:java}
> Traceback (most recent call last):
>   File "", line 2, in 
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pandas/io/parquet.py",
>  line 282, in read_parquet
>     return impl.read(path, columns=columns, **kwargs)
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pandas/io/parquet.py",
>  line 129, in read
>     **kwargs).to_pandas()
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 1152, in read_table
>     use_pandas_metadata=use_pandas_metadata)
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/filesystem.py",
>  line 181, in read_parquet
>     use_pandas_metadata=use_pandas_metadata)
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 1014, in read
>     use_pandas_metadata=use_pandas_metadata)
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 587, in read
>     dictionary = partitions.levels[i].dictionary
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 642, in dictionary
>     dictionary = lib.array(integer_keys)
>   File "pyarrow/array.pxi", line 173, in pyarrow.lib.array
>   File "pyarrow/array.pxi", line 36, in pyarrow.lib._sequence_to_array
>   File "pyarrow/error.pxi", line 104, in pyarrow.lib.check_status
> pyarrow.lib.ArrowException: Unknown error: Python int too large to convert to 
> C long{code}
> I set the priority to minor here because it's easy enough to work around this 
> in user code unless you really need the 64 bit hash (and you probably 
> shouldn't be partitioning on that anyway).
> I could take a stab at writing a patch for this if there's interest?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5430) [Python] Can read but not write parquet partitioned on large ints

2019-05-28 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849499#comment-16849499
 ] 

Joris Van den Bossche commented on ARROW-5430:
--

Not fully, see my first comment with a smaller example (without any parquet), 
giving the same error. 

In the parquet code, we do

{code}
integer_keys = [int(x) for x in self.keys]
dictionary = lib.array(integer_keys)
{code}

where {{integer_keys}} will actually be a list of python ints, where we do 
automatic inference on.

> [Python] Can read but not write parquet partitioned on large ints
> -
>
> Key: ARROW-5430
> URL: https://issues.apache.org/jira/browse/ARROW-5430
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
> Environment: Mac OSX 10.14.4, Python 3.7.1, x86_64.
>Reporter: Robin Kåveland
>Priority: Minor
>  Labels: parquet
>
> Here's a contrived example that reproduces this issue using pandas:
> {code:java}
> import numpy as np
> import pandas as pd
> real_usernames = np.array(['anonymize', 'me'])
> usernames = pd.util.hash_array(real_usernames)
> login_count = [13, 9]
> df = pd.DataFrame({'user': usernames, 'logins': login_count})
> df.to_parquet('can_write.parq', partition_cols=['user'])
> # But not read
> pd.read_parquet('can_write.parq'){code}
> Expected behaviour:
>  * Either the write fails
>  * Or the read succeeds
> Actual behaviour: The read fails with the following error:
> {code:java}
> Traceback (most recent call last):
>   File "", line 2, in 
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pandas/io/parquet.py",
>  line 282, in read_parquet
>     return impl.read(path, columns=columns, **kwargs)
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pandas/io/parquet.py",
>  line 129, in read
>     **kwargs).to_pandas()
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 1152, in read_table
>     use_pandas_metadata=use_pandas_metadata)
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/filesystem.py",
>  line 181, in read_parquet
>     use_pandas_metadata=use_pandas_metadata)
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 1014, in read
>     use_pandas_metadata=use_pandas_metadata)
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 587, in read
>     dictionary = partitions.levels[i].dictionary
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 642, in dictionary
>     dictionary = lib.array(integer_keys)
>   File "pyarrow/array.pxi", line 173, in pyarrow.lib.array
>   File "pyarrow/array.pxi", line 36, in pyarrow.lib._sequence_to_array
>   File "pyarrow/error.pxi", line 104, in pyarrow.lib.check_status
> pyarrow.lib.ArrowException: Unknown error: Python int too large to convert to 
> C long{code}
> I set the priority to minor here because it's easy enough to work around this 
> in user code unless you really need the 64 bit hash (and you probably 
> shouldn't be partitioning on that anyway).
> I could take a stab at writing a patch for this if there's interest?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5430) [Python] Can read but not write parquet partitioned on large ints

2019-05-28 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849539#comment-16849539
 ] 

Joris Van den Bossche commented on ARROW-5430:
--

The keys come from the directory names, so are essentially strings. But we try 
to recover the original data type somewhat (in case of ints, the snippet 
above), but this is of course not very robust in general.

I think we should at least fallback to leaving it as strings, if the conversion 
fails (that's an easy fix in the code, we only need to expand the exceptions in 
the except clause around the snippet above). 

> [Python] Can read but not write parquet partitioned on large ints
> -
>
> Key: ARROW-5430
> URL: https://issues.apache.org/jira/browse/ARROW-5430
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
> Environment: Mac OSX 10.14.4, Python 3.7.1, x86_64.
>Reporter: Robin Kåveland
>Priority: Minor
>  Labels: parquet
>
> Here's a contrived example that reproduces this issue using pandas:
> {code:java}
> import numpy as np
> import pandas as pd
> real_usernames = np.array(['anonymize', 'me'])
> usernames = pd.util.hash_array(real_usernames)
> login_count = [13, 9]
> df = pd.DataFrame({'user': usernames, 'logins': login_count})
> df.to_parquet('can_write.parq', partition_cols=['user'])
> # But not read
> pd.read_parquet('can_write.parq'){code}
> Expected behaviour:
>  * Either the write fails
>  * Or the read succeeds
> Actual behaviour: The read fails with the following error:
> {code:java}
> Traceback (most recent call last):
>   File "", line 2, in 
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pandas/io/parquet.py",
>  line 282, in read_parquet
>     return impl.read(path, columns=columns, **kwargs)
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pandas/io/parquet.py",
>  line 129, in read
>     **kwargs).to_pandas()
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 1152, in read_table
>     use_pandas_metadata=use_pandas_metadata)
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/filesystem.py",
>  line 181, in read_parquet
>     use_pandas_metadata=use_pandas_metadata)
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 1014, in read
>     use_pandas_metadata=use_pandas_metadata)
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 587, in read
>     dictionary = partitions.levels[i].dictionary
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 642, in dictionary
>     dictionary = lib.array(integer_keys)
>   File "pyarrow/array.pxi", line 173, in pyarrow.lib.array
>   File "pyarrow/array.pxi", line 36, in pyarrow.lib._sequence_to_array
>   File "pyarrow/error.pxi", line 104, in pyarrow.lib.check_status
> pyarrow.lib.ArrowException: Unknown error: Python int too large to convert to 
> C long{code}
> I set the priority to minor here because it's easy enough to work around this 
> in user code unless you really need the 64 bit hash (and you probably 
> shouldn't be partitioning on that anyway).
> I could take a stab at writing a patch for this if there's interest?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5430) [Python] Can read but not write parquet partitioned on large ints

2019-05-28 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16850158#comment-16850158
 ] 

Joris Van den Bossche commented on ARROW-5430:
--

Robin: yes, a fix for the error type + message would be welcome.
By doing that, that will automatically also fix the fallback to strings in the 
parquet code (your point 3), as the snippet that I showed above that failed on 
the array() call, is already in a try/except block. The error was only not 
catched because it was the wrong error type.


> [Python] Can read but not write parquet partitioned on large ints
> -
>
> Key: ARROW-5430
> URL: https://issues.apache.org/jira/browse/ARROW-5430
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
> Environment: Mac OSX 10.14.4, Python 3.7.1, x86_64.
>Reporter: Robin Kåveland
>Priority: Minor
>  Labels: parquet
>
> Here's a contrived example that reproduces this issue using pandas:
> {code:java}
> import numpy as np
> import pandas as pd
> real_usernames = np.array(['anonymize', 'me'])
> usernames = pd.util.hash_array(real_usernames)
> login_count = [13, 9]
> df = pd.DataFrame({'user': usernames, 'logins': login_count})
> df.to_parquet('can_write.parq', partition_cols=['user'])
> # But not read
> pd.read_parquet('can_write.parq'){code}
> Expected behaviour:
>  * Either the write fails
>  * Or the read succeeds
> Actual behaviour: The read fails with the following error:
> {code:java}
> Traceback (most recent call last):
>   File "", line 2, in 
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pandas/io/parquet.py",
>  line 282, in read_parquet
>     return impl.read(path, columns=columns, **kwargs)
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pandas/io/parquet.py",
>  line 129, in read
>     **kwargs).to_pandas()
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 1152, in read_table
>     use_pandas_metadata=use_pandas_metadata)
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/filesystem.py",
>  line 181, in read_parquet
>     use_pandas_metadata=use_pandas_metadata)
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 1014, in read
>     use_pandas_metadata=use_pandas_metadata)
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 587, in read
>     dictionary = partitions.levels[i].dictionary
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 642, in dictionary
>     dictionary = lib.array(integer_keys)
>   File "pyarrow/array.pxi", line 173, in pyarrow.lib.array
>   File "pyarrow/array.pxi", line 36, in pyarrow.lib._sequence_to_array
>   File "pyarrow/error.pxi", line 104, in pyarrow.lib.check_status
> pyarrow.lib.ArrowException: Unknown error: Python int too large to convert to 
> C long{code}
> I set the priority to minor here because it's easy enough to work around this 
> in user code unless you really need the 64 bit hash (and you probably 
> shouldn't be partitioning on that anyway).
> I could take a stab at writing a patch for this if there's interest?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5436) [Python] expose filters argument in parquet.read_table

2019-05-29 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5436:


 Summary: [Python] expose filters argument in parquet.read_table
 Key: ARROW-5436
 URL: https://issues.apache.org/jira/browse/ARROW-5436
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 0.14.0


Currently, the {{parquet.read_table}} function can be used both for reading a 
single file (interface to ParquetFile) as a directory (interface to 
ParquetDataset). 

ParquetDataset has some extra keywords such as {{filters}} that would be nice 
to expose through {{read_table}} as well.

Of course one can always use {{ParquetDataset}} if you need its power, but for 
pandas wrapping pyarrow it is easier to be able to pass through keywords just 
to {{parquet.read_table}} instead of calling either {{read_table}} or 
{{ParquetDataset}}. Context: https://github.com/pandas-dev/pandas/issues/26551



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5514) [C++] Printer for uint64 shows wrong values

2019-06-05 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5514:


 Summary: [C++] Printer for uint64 shows wrong values
 Key: ARROW-5514
 URL: https://issues.apache.org/jira/browse/ARROW-5514
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.13.0
Reporter: Joris Van den Bossche


>From the example in ARROW-5430:

{code}
In [16]: pa.array([14989096668145380166, 15869664087396458664], 
type=pa.uint64())   

Out[16]: 

[
  -3457647405564171450,
  -2577079986313092952
]
{code}

I _think_ the actual conversion is correct, and it's only the printer that is 
going wrong, as {{to_numpy}} gives the correct values:

{code}
In [17]: pa.array([14989096668145380166, 15869664087396458664], 
type=pa.uint64()).to_numpy()

Out[17]: array([14989096668145380166, 15869664087396458664], dtype=uint64)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5138) [Python/C++] Row group retrieval doesn't restore index properly

2019-06-05 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856508#comment-16856508
 ] 

Joris Van den Bossche commented on ARROW-5138:
--

[~wesmckinn] I don't think that will solve this problem. The _original_ 
dataframe (when converted to an arrow Table) had a trivial RangeIndex (starting 
at 0, step of 1), so the optimization would have been correctly applied 
according to that logic. 

It is only when a Table is sliced or splitted (in row groups, and then reading 
a single row group instead of the full table) that the RangeIndex metadata get 
"out of date" and no longer match the new (subsetted) arrow Table.

See also ARROW-5427 for a summary issue I made on this topic.

> [Python/C++] Row group retrieval doesn't restore index properly
> ---
>
> Key: ARROW-5138
> URL: https://issues.apache.org/jira/browse/ARROW-5138
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.13.0
>Reporter: Florian Jetter
>Priority: Minor
>  Labels: parquet
> Fix For: 0.14.0
>
>
> When retrieving row groups the index is no longer properly restored to its 
> initial value and is set to an range index starting at zero no matter what. 
> version 0.12.1 restored and int64 index with the correct index values.
> {code:python}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> print(pa.__version__)
> df = pd.DataFrame(
> {"a": [1, 2, 3, 4]}
> )
> print("total DF")
> print(df.index)
> table = pa.Table.from_pandas(df)
> buf = pa.BufferOutputStream()
> pq.write_table(table, buf, chunk_size=2)
> reader = pa.BufferReader(buf.getvalue().to_pybytes())
> parquet_file = pq.ParquetFile(reader)
> rg = parquet_file.read_row_group(1)
> df_restored = rg.to_pandas()
> print("Row group")
> print(df_restored.index)
> {code}
> Previous behavior
> {code:python}
> 0.12.1
> total DF
> RangeIndex(start=0, stop=4, step=1)
> Row group
> Int64Index([2, 3], dtype='int64')
> {code}
> Behavior now
> {code:python}
> 0.13.0
> total DF
> RangeIndex(start=0, stop=4, step=1)
> Row group
> RangeIndex(start=0, stop=2, step=1)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2667) [C++/Python] Add pandas-like take method to Array

2019-06-05 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856516#comment-16856516
 ] 

Joris Van den Bossche commented on ARROW-2667:
--

[~wesmckinn] you renamed this issue to only be about Array (and opened 
ARROW-5454 for the ChunkedArray part). So then this can be closed? (the python 
Array part was tackled in ARROW-5291)

> [C++/Python] Add pandas-like take method to Array
> -
>
> Key: ARROW-2667
> URL: https://issues.apache.org/jira/browse/ARROW-2667
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Uwe L. Korn
>Priority: Major
> Fix For: 0.14.0
>
>
> We should add a {{take}} method to {{Array/ChunkedArray/Column}} that takes a 
> list of indices and returns a reordered array.
> For reference, see Pandas' interface: 
> https://github.com/pandas-dev/pandas/blob/2cbdd9a2cd19501c98582490e35c5402ae6de941/pandas/core/arrays/base.py#L466



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-2667) [C++/Python] Add pandas-like take method to Array

2019-06-05 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856516#comment-16856516
 ] 

Joris Van den Bossche edited comment on ARROW-2667 at 6/5/19 8:55 AM:
--

[~wesmckinn] you renamed this issue to only be about Array (and opened 
ARROW-5454 for the ChunkedArray part). So then this can be closed? (the python 
Array part was tackled in ARROW-5291)

Edit: the other issue is only about C++, so we can keep this open for the 
Python side of course.


was (Author: jorisvandenbossche):
[~wesmckinn] you renamed this issue to only be about Array (and opened 
ARROW-5454 for the ChunkedArray part). So then this can be closed? (the python 
Array part was tackled in ARROW-5291)

> [C++/Python] Add pandas-like take method to Array
> -
>
> Key: ARROW-2667
> URL: https://issues.apache.org/jira/browse/ARROW-2667
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Uwe L. Korn
>Priority: Major
> Fix For: 0.14.0
>
>
> We should add a {{take}} method to {{Array/ChunkedArray/Column}} that takes a 
> list of indices and returns a reordered array.
> For reference, see Pandas' interface: 
> https://github.com/pandas-dev/pandas/blob/2cbdd9a2cd19501c98582490e35c5402ae6de941/pandas/core/arrays/base.py#L466



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5450) [Python] TimestampArray.to_pylist() fails with OverflowError: Python int too large to convert to C long

2019-06-05 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856527#comment-16856527
 ] 

Joris Van den Bossche commented on ARROW-5450:
--

Thanks for the report!

The problem here is that pyarrow converts to pandas Timestamp objects, if 
pandas is installed (and otherwise to datetime.datetime objects). And pandas 
has a the limitation of only supporting timestamps in the limited ns range of 
1677 - 2262 
([http://pandas-docs.github.io/pandas-docs-travis/user_guide/timeseries.html#timeseries-timestamp-limits]).

We could catch the overflow error and in that case still return a 
datetime.datetime object. I personally don't really like this data-dependent 
behaviour, but we already have this pandas-available-dependent behaviour 
(alternatively, we could also always return datetime.datetime, or put the 
return of pandas Timestamps behind a keyword).

 

> [Python] TimestampArray.to_pylist() fails with OverflowError: Python int too 
> large to convert to C long
> ---
>
> Key: ARROW-5450
> URL: https://issues.apache.org/jira/browse/ARROW-5450
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Tim Swast
>Priority: Major
>
> When I attempt to roundtrip from a list of moderately large (beyond what can 
> be represented in nanosecond precision, but within microsecond precision) 
> datetime objects to pyarrow and back, I get an OverflowError: Python int too 
> large to convert to C long.
> pyarrow version:
> {noformat}
> $ pip freeze | grep pyarrow
> pyarrow==0.13.0{noformat}
>  
> Reproduction:
> {code:java}
> import datetime
> import pandas
> import pyarrow
> import pytz
> timestamp_rows = [
> datetime.datetime(1, 1, 1, 0, 0, 0, tzinfo=pytz.utc),
> None,
> datetime.datetime(, 12, 31, 23, 59, 59, 99, tzinfo=pytz.utc),
> datetime.datetime(1970, 1, 1, 0, 0, 0, tzinfo=pytz.utc),
> ]
> timestamp_array = pyarrow.array(timestamp_rows, pyarrow.timestamp("us", 
> tz="UTC"))
> timestamp_roundtrip = timestamp_array.to_pylist()
> # ---
> # OverflowError Traceback (most recent call last)
> #  in 
> # > 1 timestamp_roundtrip = timestamp_array.to_pylist()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/array.pxi
>  in __iter__()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi
>  in pyarrow.lib.TimestampValue.as_py()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi
>  in pyarrow.lib._datetime_conversion_functions.lambda5()
> #
> # pandas/_libs/tslibs/timestamps.pyx in 
> pandas._libs.tslibs.timestamps.Timestamp.__new__()
> #
> # pandas/_libs/tslibs/conversion.pyx in 
> pandas._libs.tslibs.conversion.convert_to_tsobject()
> #
> # OverflowError: Python int too large to convert to C long
> {code}
> For good measure, I also tested with timezone-naive timestamps with the same 
> error:
> {code:java}
> naive_rows = [
> datetime.datetime(1, 1, 1, 0, 0, 0),
> None,
> datetime.datetime(, 12, 31, 23, 59, 59, 99),
> datetime.datetime(1970, 1, 1, 0, 0, 0),
> ]
> naive_array = pyarrow.array(naive_rows, pyarrow.timestamp("us", tz=None))
> naive_roundtrip = naive_array.to_pylist()
> # ---
> # OverflowError Traceback (most recent call last)
> #  in 
> # > 1 naive_roundtrip = naive_array.to_pylist()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/array.pxi
>  in __iter__()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi
>  in pyarrow.lib.TimestampValue.as_py()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi
>  in pyarrow.lib._datetime_conversion_functions.lambda5()
> #
> # pandas/_libs/tslibs/timestamps.pyx in 
> pandas._libs.tslibs.timestamps.Timestamp.__new__()
> #
> # pandas/_libs/tslibs/conversion.pyx in 
> pandas._libs.tslibs.conversion.convert_to_tsobject()
> #
> # OverflowError: Python int too large to convert to C long
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5104) [Python/C++] Schema for empty tables include index column as integer

2019-06-05 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-5104:
-
Fix Version/s: 0.14.0

> [Python/C++] Schema for empty tables include index column as integer
> 
>
> Key: ARROW-5104
> URL: https://issues.apache.org/jira/browse/ARROW-5104
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.13.0
>Reporter: Florian Jetter
>Priority: Minor
> Fix For: 0.14.0
>
>
> The schema for an empty table/dataframe still includes the index as an 
> integer column instead of being serialized solely as a metadata reference 
> (see ARROW-1639)
> In the example below, the empty dataframe still holds `__index_level_0__` as 
> an integer column. Proper behavior would be to exclude it and reference the 
> index information in the pandas metadata as it is the case for a non-empty 
> column
> {code}
> In [1]: import pandas as pd
> im
> In [2]: import pyarrow as pa
> In [3]: non_empty =  pd.DataFrame({"col": [1]})
> In [4]: empty = non_empty.drop(0)
> In [5]: empty
> Out[5]:
> Empty DataFrame
> Columns: [col]
> Index: []
> In [6]: pa.Table.from_pandas(non_empty)
> Out[6]:
> pyarrow.Table
> col: int64
> metadata
> 
> OrderedDict([(b'pandas',
>   b'{"index_columns": [{"kind": "range", "name": null, "start": '
>   b'0, "stop": 1, "step": 1}], "column_indexes": [{"name": null,'
>   b' "field_name": null, "pandas_type": "unicode", "numpy_type":'
>   b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": ['
>   b'{"name": "col", "field_name": "col", "pandas_type": "int64",'
>   b' "numpy_type": "int64", "metadata": null}], "creator": {"lib'
>   b'rary": "pyarrow", "version": "0.13.0"}, "pandas_version": nu'
>   b'll}')])
> In [7]: pa.Table.from_pandas(empty)
> Out[7]:
> pyarrow.Table
> col: int64
> __index_level_0__: int64
> metadata
> 
> OrderedDict([(b'pandas',
>   b'{"index_columns": ["__index_level_0__"], "column_indexes": ['
>   b'{"name": null, "field_name": null, "pandas_type": "unicode",'
>   b' "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}]'
>   b', "columns": [{"name": "col", "field_name": "col", "pandas_t'
>   b'ype": "int64", "numpy_type": "int64", "metadata": null}, {"n'
>   b'ame": null, "field_name": "__index_level_0__", "pandas_type"'
>   b': "int64", "numpy_type": "int64", "metadata": null}], "creat'
>   b'or": {"library": "pyarrow", "version": "0.13.0"}, "pandas_ve'
>   b'rsion": null}')])
> In [8]: pa.__version__
> Out[8]: '0.13.0'
> In [9]: ! python --version
> Python 3.6.7
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5480) [Python] Pandas categorical type doesn't survive a round-trip through parquet

2019-06-05 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856967#comment-16856967
 ] 

Joris Van den Bossche commented on ARROW-5480:
--

[~wesmckinn] I think this can be closed as duplicate of the other issue?

> [Python] Pandas categorical type doesn't survive a round-trip through parquet
> -
>
> Key: ARROW-5480
> URL: https://issues.apache.org/jira/browse/ARROW-5480
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.11.1, 0.13.0
> Environment: python: 3.7.3.final.0
> python-bits: 64
> OS: Linux
> OS-release: 5.0.0-15-generic
> machine: x86_64
> processor: x86_64
> byteorder: little
> pandas: 0.24.2
> numpy: 1.16.4
> pyarrow: 0.13.0
>Reporter: Karl Dunkle Werner
>Priority: Minor
>
> Writing a string categorical variable to from pandas parquet is read back as 
> string (object dtype). I expected it to be read as category.
> The same thing happens if the category is numeric -- a numeric category is 
> read back as int64.
> In the code below, I tried out an in-memory arrow Table, which successfully 
> translates categories back to pandas. However, when I write to a parquet 
> file, it's not.
> In the scheme of things, this isn't a big deal, but it's a small surprise.
> {code:python}
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame({'x': pd.Categorical(['a', 'a', 'b', 'b'])})
> df.dtypes  # category
> # This works:
> pa.Table.from_pandas(df).to_pandas().dtypes  # category
> df.to_parquet("categories.parquet")
> # This reads back object, but I expected category
> pd.read_parquet("categories.parquet").dtypes  # object
> # Numeric categories have the same issue:
> df_num = pd.DataFrame({'x': pd.Categorical([1, 1, 2, 2])})
> df_num.dtypes # category
> pa.Table.from_pandas(df_num).to_pandas().dtypes  # category
> df_num.to_parquet("categories_num.parquet")
> # This reads back int64, but I expected category
> pd.read_parquet("categories_num.parquet").dtypes  # int64
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5450) [Python] TimestampArray.to_pylist() fails with OverflowError: Python int too large to convert to C long

2019-06-07 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858365#comment-16858365
 ] 

Joris Van den Bossche commented on ARROW-5450:
--

Yes, certainly given the time range limitations of pandas.Timestamp, that would 
be a good option. I am not sure to what extent we can change this, or want to 
introduce options for this though.

> [Python] TimestampArray.to_pylist() fails with OverflowError: Python int too 
> large to convert to C long
> ---
>
> Key: ARROW-5450
> URL: https://issues.apache.org/jira/browse/ARROW-5450
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Tim Swast
>Priority: Major
>
> When I attempt to roundtrip from a list of moderately large (beyond what can 
> be represented in nanosecond precision, but within microsecond precision) 
> datetime objects to pyarrow and back, I get an OverflowError: Python int too 
> large to convert to C long.
> pyarrow version:
> {noformat}
> $ pip freeze | grep pyarrow
> pyarrow==0.13.0{noformat}
>  
> Reproduction:
> {code:java}
> import datetime
> import pandas
> import pyarrow
> import pytz
> timestamp_rows = [
> datetime.datetime(1, 1, 1, 0, 0, 0, tzinfo=pytz.utc),
> None,
> datetime.datetime(, 12, 31, 23, 59, 59, 99, tzinfo=pytz.utc),
> datetime.datetime(1970, 1, 1, 0, 0, 0, tzinfo=pytz.utc),
> ]
> timestamp_array = pyarrow.array(timestamp_rows, pyarrow.timestamp("us", 
> tz="UTC"))
> timestamp_roundtrip = timestamp_array.to_pylist()
> # ---
> # OverflowError Traceback (most recent call last)
> #  in 
> # > 1 timestamp_roundtrip = timestamp_array.to_pylist()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/array.pxi
>  in __iter__()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi
>  in pyarrow.lib.TimestampValue.as_py()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi
>  in pyarrow.lib._datetime_conversion_functions.lambda5()
> #
> # pandas/_libs/tslibs/timestamps.pyx in 
> pandas._libs.tslibs.timestamps.Timestamp.__new__()
> #
> # pandas/_libs/tslibs/conversion.pyx in 
> pandas._libs.tslibs.conversion.convert_to_tsobject()
> #
> # OverflowError: Python int too large to convert to C long
> {code}
> For good measure, I also tested with timezone-naive timestamps with the same 
> error:
> {code:java}
> naive_rows = [
> datetime.datetime(1, 1, 1, 0, 0, 0),
> None,
> datetime.datetime(, 12, 31, 23, 59, 59, 99),
> datetime.datetime(1970, 1, 1, 0, 0, 0),
> ]
> naive_array = pyarrow.array(naive_rows, pyarrow.timestamp("us", tz=None))
> naive_roundtrip = naive_array.to_pylist()
> # ---
> # OverflowError Traceback (most recent call last)
> #  in 
> # > 1 naive_roundtrip = naive_array.to_pylist()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/array.pxi
>  in __iter__()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi
>  in pyarrow.lib.TimestampValue.as_py()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi
>  in pyarrow.lib._datetime_conversion_functions.lambda5()
> #
> # pandas/_libs/tslibs/timestamps.pyx in 
> pandas._libs.tslibs.timestamps.Timestamp.__new__()
> #
> # pandas/_libs/tslibs/conversion.pyx in 
> pandas._libs.tslibs.conversion.convert_to_tsobject()
> #
> # OverflowError: Python int too large to convert to C long
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4350) [Python] nested numpy arrays

2019-06-07 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-4350:
-
Description: 
Nested numpy arrays cannot be converted to a list-of-list type array:

{code:python}
arr = np.empty(2, dtype=object)
arr[:] = [np.array([1, 2]), np.array([2, 3])]

pa.array([arr, arr])
{code}

results in

{code}
ArrowTypeError: only size-1 arrays can be converted to Python scalars
{code}

Starting from lists of lists works fine:

{code:python}
lists = [[1, 2], [2, 3]]
pa.array([lists, lists]).type
{code}

{code:none}
ListType(list>)
{code}

Specifying the type explicitly as {{pa.array([arr, arr], 
type=pa.list_(pa.list_(pa.int64(}} does not help.

Due to this, a round-trip is not working, as the list of list type gives back 
an array of arrays in python:

{code:python}
In [2]: lists = [[1, 2], [2, 3]] 
   ...: a = pa.array([lists, lists])



In [3]: a.to_pandas()   


Out[3]: 
array([array([array([1, 2]), array([2, 3])], dtype=object),
   array([array([1, 2]), array([2, 3])], dtype=object)], dtype=object)

In [4]: pa.array(a.to_pandas()) 


---
ArrowTypeErrorTraceback (most recent call last)
 in 
> 1 pa.array(a.to_pandas())

~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib.array()

~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()

~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowTypeError: only size-1 arrays can be converted to Python scalars
{code}





Origingal report:

{code:java}
In [19]: df = pd.DataFrame({'a': [[[1], [2]], [[2], [3]]], 'b': [1, 2]})

In [20]: df.iloc[0].to_dict()
Out[20]: {'a': [[1], [2]], 'b': 1}

In [21]: pa.Table.from_pandas(df).to_pandas().iloc[0].to_dict()
Out[21]: {'a': array([array([1]), array([2])], dtype=object), 'b': 1}

In [24]: np.array(df.iloc[0].to_dict()['a']).shape
Out[24]: (2, 1)

In [25]: pa.Table.from_pandas(df).to_pandas().iloc[0].to_dict()['a'].shape
Out[25]: (2,)
{code}
Adding extra array type is not functioning as expected. 

 

More importantly, this would fail

 
{code:java}
In [108]: df = pd.DataFrame({'a': [[[1, 2],[2, 3]], [[1,2], [2, 3]]], 'b': [[1, 
2],[2, 3]]})

In [109]: df
Out[109]:
a b
0 [[1, 2], [2, 3]] [1, 2]
1 [[1, 2], [2, 3]] [2, 3]

In [110]: pa.Table.from_pandas(pa.Table.from_pandas(df).to_pandas())
---
ArrowTypeError Traceback (most recent call last)
 in ()
> 1 pa.Table.from_pandas(pa.Table.from_pandas(df).to_pandas())

/Users/pengyu/.pyenv/virtualenvs/starscream/2.7.11/lib/python2.7/site-packages/pyarrow/table.pxi
 in pyarrow.lib.Table.from_pandas()
1215 
1216 """
-> 1217 names, arrays, metadata = pdcompat.dataframe_to_arrays(
1218 df,
1219 schema=schema,

/Users/pengyu/.pyenv/virtualenvs/starscream/2.7.11/lib/python2.7/site-packages/pyarrow/pandas_compat.pyc
 in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
379 arrays = [convert_column(c, t)
380 for c, t in zip(columns_to_convert,
--> 381 convert_types)]
382 else:
383 from concurrent import futures

/Users/pengyu/.pyenv/virtualenvs/starscream/2.7.11/lib/python2.7/site-packages/pyarrow/pandas_compat.pyc
 in convert_column(col, ty)
374 e.args += ("Conversion failed for column {0!s} with type {1!s}"
375 .format(col.name, col.dtype),)
--> 376 raise e
377
378 if nthreads == 1:

ArrowTypeError: ('only size-1 arrays can be converted to Python scalars', 
'Conversion failed for column a with type object')

{code}
 

  was:
{code:java}
In [19]: df = pd.DataFrame({'a': [[[1], [2]], [[2], [3]]], 'b': [1, 2]})

In [20]: df.iloc[0].to_dict()
Out[20]: {'a': [[1], [2]], 'b': 1}

In [21]: pa.Table.from_pandas(df).to_pandas().iloc[0].to_dict()
Out[21]: {'a': array([array([1]), array([2])], dtype=object), 'b': 1}

In [24]: np.array(df.iloc[0].to_dict()['a']).shape
Out[24]: (2, 1)

In [25]: pa.Table.from_pandas(df).to_pandas().iloc[0].to_dict()['a'].shape
Out[25]: (2,)
{code}
Adding extra array type is not functioning as expected. 

 

More importantly, this would fail

 
{code:java}
In [108]: df = pd.DataFrame({'a': [[[1, 2],[2, 3]], [[1,2], [2, 3]]], 'b': [[1, 
2],[2, 3]]})

In [109]: df
Out[109]:
a b
0 [[1, 2], [2, 3]] [1, 2]
1 [[1, 2], [2, 3]] [2, 3]

In [110

[jira] [Updated] (ARROW-4350) [Python] nested numpy arrays cannot be converted to ListArray

2019-06-07 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-4350:
-
Summary: [Python] nested numpy arrays cannot be converted to ListArray  
(was: [Python] nested numpy arrays)

> [Python] nested numpy arrays cannot be converted to ListArray
> -
>
> Key: ARROW-4350
> URL: https://issues.apache.org/jira/browse/ARROW-4350
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1, 0.12.0
>Reporter: yu peng
>Priority: Major
> Fix For: 0.14.0
>
>
> Nested numpy arrays cannot be converted to a list-of-list type array:
> {code:python}
> arr = np.empty(2, dtype=object)
> arr[:] = [np.array([1, 2]), np.array([2, 3])]
> pa.array([arr, arr])
> {code}
> results in
> {code}
> ArrowTypeError: only size-1 arrays can be converted to Python scalars
> {code}
> Starting from lists of lists works fine:
> {code:python}
> lists = [[1, 2], [2, 3]]
> pa.array([lists, lists]).type
> {code}
> {code:none}
> ListType(list>)
> {code}
> Specifying the type explicitly as {{pa.array([arr, arr], 
> type=pa.list_(pa.list_(pa.int64(}} does not help.
> Due to this, a round-trip is not working, as the list of list type gives back 
> an array of arrays in python:
> {code:python}
> In [2]: lists = [[1, 2], [2, 3]] 
>...: a = pa.array([lists, lists])  
>   
> 
> In [3]: a.to_pandas() 
>   
> 
> Out[3]: 
> array([array([array([1, 2]), array([2, 3])], dtype=object),
>array([array([1, 2]), array([2, 3])], dtype=object)], dtype=object)
> In [4]: pa.array(a.to_pandas())   
>   
> 
> ---
> ArrowTypeErrorTraceback (most recent call last)
>  in 
> > 1 pa.array(a.to_pandas())
> ~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib.array()
> ~/scipy/repos/arrow/python/pyarrow/array.pxi in 
> pyarrow.lib._ndarray_to_array()
> ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
> ArrowTypeError: only size-1 arrays can be converted to Python scalars
> {code}
> 
> Origingal report:
> {code:java}
> In [19]: df = pd.DataFrame({'a': [[[1], [2]], [[2], [3]]], 'b': [1, 2]})
> In [20]: df.iloc[0].to_dict()
> Out[20]: {'a': [[1], [2]], 'b': 1}
> In [21]: pa.Table.from_pandas(df).to_pandas().iloc[0].to_dict()
> Out[21]: {'a': array([array([1]), array([2])], dtype=object), 'b': 1}
> In [24]: np.array(df.iloc[0].to_dict()['a']).shape
> Out[24]: (2, 1)
> In [25]: pa.Table.from_pandas(df).to_pandas().iloc[0].to_dict()['a'].shape
> Out[25]: (2,)
> {code}
> Adding extra array type is not functioning as expected. 
>  
> More importantly, this would fail
>  
> {code:java}
> In [108]: df = pd.DataFrame({'a': [[[1, 2],[2, 3]], [[1,2], [2, 3]]], 'b': 
> [[1, 2],[2, 3]]})
> In [109]: df
> Out[109]:
> a b
> 0 [[1, 2], [2, 3]] [1, 2]
> 1 [[1, 2], [2, 3]] [2, 3]
> In [110]: pa.Table.from_pandas(pa.Table.from_pandas(df).to_pandas())
> ---
> ArrowTypeError Traceback (most recent call last)
>  in ()
> > 1 pa.Table.from_pandas(pa.Table.from_pandas(df).to_pandas())
> /Users/pengyu/.pyenv/virtualenvs/starscream/2.7.11/lib/python2.7/site-packages/pyarrow/table.pxi
>  in pyarrow.lib.Table.from_pandas()
> 1215 
> 1216 """
> -> 1217 names, arrays, metadata = pdcompat.dataframe_to_arrays(
> 1218 df,
> 1219 schema=schema,
> /Users/pengyu/.pyenv/virtualenvs/starscream/2.7.11/lib/python2.7/site-packages/pyarrow/pandas_compat.pyc
>  in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
> 379 arrays = [convert_column(c, t)
> 380 for c, t in zip(columns_to_convert,
> --> 381 convert_types)]
> 382 else:
> 383 from concurrent import futures
> /Users/pengyu/.pyenv/virtualenvs/starscream/2.7.11/lib/python2.7/site-packages/pyarrow/pandas_compat.pyc
>  in convert_column(col, ty)
> 374 e.args += ("Conversion failed for column {0!s} with type {1!s}"
> 375 .format(col.name, col.dtype),)
> --> 376 raise e
> 377
> 378 if nthreads == 1:
> ArrowTypeError: ('only size-1 arrays can be converted to Python scalars', 
> 'Conversion failed for column a with type object')
> {code}
>  



--
This message was sen

[jira] [Updated] (ARROW-4350) [Python] nested numpy arrays cannot be converted to a list-of-list ListArray

2019-06-07 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-4350:
-
Summary: [Python] nested numpy arrays cannot be converted to a list-of-list 
ListArray  (was: [Python] nested numpy arrays cannot be converted to ListArray)

> [Python] nested numpy arrays cannot be converted to a list-of-list ListArray
> 
>
> Key: ARROW-4350
> URL: https://issues.apache.org/jira/browse/ARROW-4350
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1, 0.12.0
>Reporter: yu peng
>Priority: Major
> Fix For: 0.14.0
>
>
> Nested numpy arrays (as the scalar value) cannot be converted to a 
> list-of-list type array:
> {code}
> arr = np.empty(2, dtype=object)
> arr[:] = [np.array([1, 2]), np.array([2, 3])]
> pa.array([arr, arr])
> {code}
> results in
> {code:java}
> ArrowTypeError: only size-1 arrays can be converted to Python scalars
> {code}
> Starting from lists of lists works fine:
> {code}
> lists = [[1, 2], [2, 3]]
> pa.array([lists, lists]).type
> {code}
> {code:none}
> ListType(list>)
> {code}
> Specifying the type explicitly as {{pa.array([arr, arr], 
> type=pa.list_(pa.list_(pa.int64(}} does not help.
> Due to this, a round-trip is not working, as the list of list type gives back 
> an array of arrays in python:
> {code}
> In [2]: lists = [[1, 2], [2, 3]] 
>...: a = pa.array([lists, lists])  
>   
> 
> In [3]: a.to_pandas() 
>   
> 
> Out[3]: 
> array([array([array([1, 2]), array([2, 3])], dtype=object),
>array([array([1, 2]), array([2, 3])], dtype=object)], dtype=object)
> In [4]: pa.array(a.to_pandas())   
>   
> 
> ---
> ArrowTypeErrorTraceback (most recent call last)
>  in 
> > 1 pa.array(a.to_pandas())
> ~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib.array()
> ~/scipy/repos/arrow/python/pyarrow/array.pxi in 
> pyarrow.lib._ndarray_to_array()
> ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
> ArrowTypeError: only size-1 arrays can be converted to Python scalars
> {code}
> 
> Origingal report:
> {code:java}
> In [19]: df = pd.DataFrame({'a': [[[1], [2]], [[2], [3]]], 'b': [1, 2]})
> In [20]: df.iloc[0].to_dict()
> Out[20]: {'a': [[1], [2]], 'b': 1}
> In [21]: pa.Table.from_pandas(df).to_pandas().iloc[0].to_dict()
> Out[21]: {'a': array([array([1]), array([2])], dtype=object), 'b': 1}
> In [24]: np.array(df.iloc[0].to_dict()['a']).shape
> Out[24]: (2, 1)
> In [25]: pa.Table.from_pandas(df).to_pandas().iloc[0].to_dict()['a'].shape
> Out[25]: (2,)
> {code}
> Adding extra array type is not functioning as expected. 
>  
> More importantly, this would fail
>  
> {code:java}
> In [108]: df = pd.DataFrame({'a': [[[1, 2],[2, 3]], [[1,2], [2, 3]]], 'b': 
> [[1, 2],[2, 3]]})
> In [109]: df
> Out[109]:
> a b
> 0 [[1, 2], [2, 3]] [1, 2]
> 1 [[1, 2], [2, 3]] [2, 3]
> In [110]: pa.Table.from_pandas(pa.Table.from_pandas(df).to_pandas())
> ---
> ArrowTypeError Traceback (most recent call last)
>  in ()
> > 1 pa.Table.from_pandas(pa.Table.from_pandas(df).to_pandas())
> /Users/pengyu/.pyenv/virtualenvs/starscream/2.7.11/lib/python2.7/site-packages/pyarrow/table.pxi
>  in pyarrow.lib.Table.from_pandas()
> 1215 
> 1216 """
> -> 1217 names, arrays, metadata = pdcompat.dataframe_to_arrays(
> 1218 df,
> 1219 schema=schema,
> /Users/pengyu/.pyenv/virtualenvs/starscream/2.7.11/lib/python2.7/site-packages/pyarrow/pandas_compat.pyc
>  in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
> 379 arrays = [convert_column(c, t)
> 380 for c, t in zip(columns_to_convert,
> --> 381 convert_types)]
> 382 else:
> 383 from concurrent import futures
> /Users/pengyu/.pyenv/virtualenvs/starscream/2.7.11/lib/python2.7/site-packages/pyarrow/pandas_compat.pyc
>  in convert_column(col, ty)
> 374 e.args += ("Conversion failed for column {0!s} with type {1!s}"
> 375 .format(col.name, col.dtype),)
> --> 376 raise e
> 377
> 378 if nthreads == 1:
> ArrowTypeError: ('only size-1 arrays can be converted to Python scalars', 
> 'Co

[jira] [Updated] (ARROW-4350) [Python] nested numpy arrays cannot be converted to ListArray

2019-06-07 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-4350:
-
Description: 
Nested numpy arrays (as the scalar value) cannot be converted to a list-of-list 
type array:
{code}
arr = np.empty(2, dtype=object)
arr[:] = [np.array([1, 2]), np.array([2, 3])]

pa.array([arr, arr])
{code}
results in
{code:java}
ArrowTypeError: only size-1 arrays can be converted to Python scalars
{code}
Starting from lists of lists works fine:
{code}
lists = [[1, 2], [2, 3]]
pa.array([lists, lists]).type
{code}
{code:none}
ListType(list>)
{code}
Specifying the type explicitly as {{pa.array([arr, arr], 
type=pa.list_(pa.list_(pa.int64(}} does not help.

Due to this, a round-trip is not working, as the list of list type gives back 
an array of arrays in python:
{code}
In [2]: lists = [[1, 2], [2, 3]] 
   ...: a = pa.array([lists, lists])



In [3]: a.to_pandas()   


Out[3]: 
array([array([array([1, 2]), array([2, 3])], dtype=object),
   array([array([1, 2]), array([2, 3])], dtype=object)], dtype=object)

In [4]: pa.array(a.to_pandas()) 


---
ArrowTypeErrorTraceback (most recent call last)
 in 
> 1 pa.array(a.to_pandas())

~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib.array()

~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()

~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowTypeError: only size-1 arrays can be converted to Python scalars
{code}

Origingal report:
{code:java}
In [19]: df = pd.DataFrame({'a': [[[1], [2]], [[2], [3]]], 'b': [1, 2]})

In [20]: df.iloc[0].to_dict()
Out[20]: {'a': [[1], [2]], 'b': 1}

In [21]: pa.Table.from_pandas(df).to_pandas().iloc[0].to_dict()
Out[21]: {'a': array([array([1]), array([2])], dtype=object), 'b': 1}

In [24]: np.array(df.iloc[0].to_dict()['a']).shape
Out[24]: (2, 1)

In [25]: pa.Table.from_pandas(df).to_pandas().iloc[0].to_dict()['a'].shape
Out[25]: (2,)
{code}
Adding extra array type is not functioning as expected. 

 

More importantly, this would fail

 
{code:java}
In [108]: df = pd.DataFrame({'a': [[[1, 2],[2, 3]], [[1,2], [2, 3]]], 'b': [[1, 
2],[2, 3]]})

In [109]: df
Out[109]:
a b
0 [[1, 2], [2, 3]] [1, 2]
1 [[1, 2], [2, 3]] [2, 3]

In [110]: pa.Table.from_pandas(pa.Table.from_pandas(df).to_pandas())
---
ArrowTypeError Traceback (most recent call last)
 in ()
> 1 pa.Table.from_pandas(pa.Table.from_pandas(df).to_pandas())

/Users/pengyu/.pyenv/virtualenvs/starscream/2.7.11/lib/python2.7/site-packages/pyarrow/table.pxi
 in pyarrow.lib.Table.from_pandas()
1215 
1216 """
-> 1217 names, arrays, metadata = pdcompat.dataframe_to_arrays(
1218 df,
1219 schema=schema,

/Users/pengyu/.pyenv/virtualenvs/starscream/2.7.11/lib/python2.7/site-packages/pyarrow/pandas_compat.pyc
 in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
379 arrays = [convert_column(c, t)
380 for c, t in zip(columns_to_convert,
--> 381 convert_types)]
382 else:
383 from concurrent import futures

/Users/pengyu/.pyenv/virtualenvs/starscream/2.7.11/lib/python2.7/site-packages/pyarrow/pandas_compat.pyc
 in convert_column(col, ty)
374 e.args += ("Conversion failed for column {0!s} with type {1!s}"
375 .format(col.name, col.dtype),)
--> 376 raise e
377
378 if nthreads == 1:

ArrowTypeError: ('only size-1 arrays can be converted to Python scalars', 
'Conversion failed for column a with type object')

{code}
 

  was:
Nested numpy arrays cannot be converted to a list-of-list type array:

{code:python}
arr = np.empty(2, dtype=object)
arr[:] = [np.array([1, 2]), np.array([2, 3])]

pa.array([arr, arr])
{code}

results in

{code}
ArrowTypeError: only size-1 arrays can be converted to Python scalars
{code}

Starting from lists of lists works fine:

{code:python}
lists = [[1, 2], [2, 3]]
pa.array([lists, lists]).type
{code}

{code:none}
ListType(list>)
{code}

Specifying the type explicitly as {{pa.array([arr, arr], 
type=pa.list_(pa.list_(pa.int64(}} does not help.

Due to this, a round-trip is not working, as the list of list type gives back 
an array of arrays in python:

{code:python}
In [2]: lists = [[1, 2], [2, 3]] 
   ...: a = pa.array([lists, list

[jira] [Commented] (ARROW-4350) [Python] nested numpy arrays cannot be converted to a list-of-list ListArray

2019-06-07 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858438#comment-16858438
 ] 

Joris Van den Bossche commented on ARROW-4350:
--

Updated the title and top post with additional explanation.

The main problem of this limitation is that a roundtrip is not possible 
(ListArray with nested lists result in arrays-of-arrays when converting to 
pandas, but arrays-of-arrays cannot be converted back to a nested ListArray.

For the subarrays of length 1 (the first example of the original report) 
"seems" to work, but actually gives a wrong result on round-trip:

{code:python}
In [5]: arr = [[1], [2]] 
   ...: a = pa.array([arr, arr])



In [6]: a.to_pandas()   


Out[6]: 
array([array([array([1]), array([2])], dtype=object),
   array([array([1]), array([2])], dtype=object)], dtype=object)

In [7]: a.to_pylist()   


Out[7]: [[[1], [2]], [[1], [2]]]

In [8]: pa.array(a.to_pandas()) 


Out[8]: 

[
  [
[
  1,
  2
]
  ],
  [
[
  1,
  2
]
  ]
]

In [9]: pa.array(a.to_pylist()) 


Out[9]: 

[
  [
[
  1
],
[
  2
]
  ],
  [
[
  1
],
[
  2
]
  ]
]
{code}

So both give the same type, but the array-of-array incorrectly gives a "list of 
1 list of len 2" as scalar instead of "list of 2 lists of len 1" as scalar.

> [Python] nested numpy arrays cannot be converted to a list-of-list ListArray
> 
>
> Key: ARROW-4350
> URL: https://issues.apache.org/jira/browse/ARROW-4350
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1, 0.12.0
>Reporter: yu peng
>Priority: Major
> Fix For: 0.14.0
>
>
> Nested numpy arrays (as the scalar value) cannot be converted to a 
> list-of-list type array:
> {code}
> arr = np.empty(2, dtype=object)
> arr[:] = [np.array([1, 2]), np.array([2, 3])]
> pa.array([arr, arr])
> {code}
> results in
> {code:java}
> ArrowTypeError: only size-1 arrays can be converted to Python scalars
> {code}
> Starting from lists of lists works fine:
> {code}
> lists = [[1, 2], [2, 3]]
> pa.array([lists, lists]).type
> {code}
> {code:none}
> ListType(list>)
> {code}
> Specifying the type explicitly as {{pa.array([arr, arr], 
> type=pa.list_(pa.list_(pa.int64(}} does not help.
> Due to this, a round-trip is not working, as the list of list type gives back 
> an array of arrays in python:
> {code}
> In [2]: lists = [[1, 2], [2, 3]] 
>...: a = pa.array([lists, lists])  
>   
> 
> In [3]: a.to_pandas() 
>   
> 
> Out[3]: 
> array([array([array([1, 2]), array([2, 3])], dtype=object),
>array([array([1, 2]), array([2, 3])], dtype=object)], dtype=object)
> In [4]: pa.array(a.to_pandas())   
>   
> 
> ---
> ArrowTypeErrorTraceback (most recent call last)
>  in 
> > 1 pa.array(a.to_pandas())
> ~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib.array()
> ~/scipy/repos/arrow/python/pyarrow/array.pxi in 
> pyarrow.lib._ndarray_to_array()
> ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
> ArrowTypeError: only size-1 arrays can be converted to Python scalars
> {code}
> 
> Origingal report:
> {code:java}
> In [19]: df = pd.DataFrame(

[jira] [Commented] (ARROW-3801) [Python] Pandas-Arrow roundtrip makes pd categorical index not writeable

2019-06-07 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858460#comment-16858460
 ] 

Joris Van den Bossche commented on ARROW-3801:
--

[~buhrmann] do you know which version of pandas you were using? 

As for me, with the combinations of pandas+arrow master or pandas 0.24.2 + 
arrow 0.12.1, this works fine for me (the reordered categorical its categories 
get turned into a writable numpy array).

There have been improvements in pandas to deal with read-only arrays related to 
hastables, such as https://github.com/pandas-dev/pandas/pull/18825 and 
https://github.com/pandas-dev/pandas/pull/21688, so those might have fixed it.

> [Python] Pandas-Arrow roundtrip makes pd categorical index not writeable
> 
>
> Key: ARROW-3801
> URL: https://issues.apache.org/jira/browse/ARROW-3801
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.10.0
>Reporter: Thomas Buhrmann
>Priority: Major
> Fix For: 0.14.0
>
>
> Serializing and deserializing a pandas series with categorical dtype will 
> make the categorical index non-writeable, which in turn trips up pandas when 
> e.g. reordering the categories, raising "ValueError: buffer source array is 
> read-only" :
> {code}
> import pandas as pd
> import pyarrow as pa
> df = pd.Series([1,2,3], dtype='category', name="c1").to_frame()
> print("DType before:", repr(df.c1.dtype))
> print("Writeable:", df.c1.cat.categories.values.flags.writeable)
> ro = df.c1.cat.reorder_categories([3,2,1])
> print("DType reordered:", repr(ro.dtype), "\n")
> tbl = pa.Table.from_pandas(df)
> df2 = tbl.to_pandas()
> print("DType after:", repr(df2.c1.dtype))
> print("Writeable:", df2.c1.cat.categories.values.flags.writeable)
> ro = df2.c1.cat.reorder_categories([3,2,1])
> print("DType reordered:", repr(ro.dtype), "\n")
> {code}
>  
> Outputs:
>  
> {code:java}
> DType before: CategoricalDtype(categories=[1, 2, 3], ordered=False)
> Writeable: True
> DType reordered: CategoricalDtype(categories=[3, 2, 1], ordered=False)
> DType after: CategoricalDtype(categories=[1, 2, 3], ordered=False)
> Writeable: False
> ---
> ValueError Traceback (most recent call last)
>  in 
>  12 print("DType after:", repr(df2.c1.dtype))
>  13 print("Writeable:", df2.c1.cat.categories.values.flags.writeable)
> ---> 14 ro = df2.c1.cat.reorder_categories([3,2,1])
>  15 print("DType reordered:", repr(ro.dtype), "\n")
> {code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-3801) [Python] Pandas-Arrow roundtrip makes pd categorical index not writeable

2019-06-07 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-3801.
--
Resolution: Works for Me

I am going to close this issue, as I think it is fixed on the latest versions. 
But [~buhrmann], it would be good it you could also check it is fixed for you. 
We can always reopen if that doesn't seem to be the case.

> [Python] Pandas-Arrow roundtrip makes pd categorical index not writeable
> 
>
> Key: ARROW-3801
> URL: https://issues.apache.org/jira/browse/ARROW-3801
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.10.0
>Reporter: Thomas Buhrmann
>Priority: Major
> Fix For: 0.14.0
>
>
> Serializing and deserializing a pandas series with categorical dtype will 
> make the categorical index non-writeable, which in turn trips up pandas when 
> e.g. reordering the categories, raising "ValueError: buffer source array is 
> read-only" :
> {code}
> import pandas as pd
> import pyarrow as pa
> df = pd.Series([1,2,3], dtype='category', name="c1").to_frame()
> print("DType before:", repr(df.c1.dtype))
> print("Writeable:", df.c1.cat.categories.values.flags.writeable)
> ro = df.c1.cat.reorder_categories([3,2,1])
> print("DType reordered:", repr(ro.dtype), "\n")
> tbl = pa.Table.from_pandas(df)
> df2 = tbl.to_pandas()
> print("DType after:", repr(df2.c1.dtype))
> print("Writeable:", df2.c1.cat.categories.values.flags.writeable)
> ro = df2.c1.cat.reorder_categories([3,2,1])
> print("DType reordered:", repr(ro.dtype), "\n")
> {code}
>  
> Outputs:
>  
> {code:java}
> DType before: CategoricalDtype(categories=[1, 2, 3], ordered=False)
> Writeable: True
> DType reordered: CategoricalDtype(categories=[3, 2, 1], ordered=False)
> DType after: CategoricalDtype(categories=[1, 2, 3], ordered=False)
> Writeable: False
> ---
> ValueError Traceback (most recent call last)
>  in 
>  12 print("DType after:", repr(df2.c1.dtype))
>  13 print("Writeable:", df2.c1.cat.categories.values.flags.writeable)
> ---> 14 ro = df2.c1.cat.reorder_categories([3,2,1])
>  15 print("DType reordered:", repr(ro.dtype), "\n")
> {code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2298) [Python] Add option to not consider NaN to be null when converting to an integer Arrow type

2019-06-07 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858547#comment-16858547
 ] 

Joris Van den Bossche commented on ARROW-2298:
--

[~farnoy] For me, the example you show above works:
{code}
In [33]: schema = pa.schema([)a.field(name='a', type=pa.int64(), 
nullable=True)])


In [34]: pa.Table.from_pandas(df, schema=schema, preserve_index=False)  

   
Out[34]: 
pyarrow.Table
a: int64
metadata

{b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":'
b' "a", "field_name": "a", "pandas_type": "int64", "numpy_type": "'
b'float64", "metadata": null}], "creator": {"library": "pyarrow", '
b'"version": "0.13.1.dev313+g997226a9"}, "pandas_version": "0.24.2'
b'"}'}

In [35]: table = _  



In [36]: table.column('a')  


Out[36]: 

[
  [
null,
1,
2,
3,
null
  ]
]
{code}

this is because in {{Table.from_pandas}} we assume data are coming from pandas 
and allow the above. 

Using just the array API, you can see that with (converting float numpy array 
to integer arrow array):

{code:python}
In [41]: pa.array(np.array([1, 2, np.nan], dtype=float), type=pa.int64())   


...
ArrowInvalid: Floating point value truncated

In [42]: pa.array(np.array([1, 2, np.nan], dtype=float), type=pa.int64(), 
from_pandas=True)   
  
Out[42]: 

[
  1,
  2,
  null
]
{code}

Does that satisfy your use case? 

It might not help with for very big integers that cannot be represented 
properly as floats (that will still raise an error about values being 
truncated), but I think if you are coming from pandas, that use case will not 
be very frequent, exactly because pandas cannot properly represent that itself.

> [Python] Add option to not consider NaN to be null when converting to an 
> integer Arrow type
> ---
>
> Key: ARROW-2298
> URL: https://issues.apache.org/jira/browse/ARROW-2298
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Follow-on work to ARROW-2135



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3801) [Python] Pandas-Arrow roundtrip makes pd categorical index not writeable

2019-06-07 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858552#comment-16858552
 ] 

Joris Van den Bossche commented on ARROW-3801:
--

I am not yet too familiar with the logics behind the conversions from arrow to 
python, but I want to note that also plain array conversion gives a read-only 
numpy array:

{code:python}
In [53]: a = pa.array([1, 2, 3])



In [54]: a.to_pandas()  


Out[54]: array([1, 2, 3])

In [55]: a.to_pandas().flags.writeable  


Out[55]: False
{code}

So this is in any case not specific to categoricals (DictionaryArray). And eg 
also the codes of the categorical are read-only.

> [Python] Pandas-Arrow roundtrip makes pd categorical index not writeable
> 
>
> Key: ARROW-3801
> URL: https://issues.apache.org/jira/browse/ARROW-3801
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.10.0
>Reporter: Thomas Buhrmann
>Priority: Major
> Fix For: 0.14.0
>
>
> Serializing and deserializing a pandas series with categorical dtype will 
> make the categorical index non-writeable, which in turn trips up pandas when 
> e.g. reordering the categories, raising "ValueError: buffer source array is 
> read-only" :
> {code}
> import pandas as pd
> import pyarrow as pa
> df = pd.Series([1,2,3], dtype='category', name="c1").to_frame()
> print("DType before:", repr(df.c1.dtype))
> print("Writeable:", df.c1.cat.categories.values.flags.writeable)
> ro = df.c1.cat.reorder_categories([3,2,1])
> print("DType reordered:", repr(ro.dtype), "\n")
> tbl = pa.Table.from_pandas(df)
> df2 = tbl.to_pandas()
> print("DType after:", repr(df2.c1.dtype))
> print("Writeable:", df2.c1.cat.categories.values.flags.writeable)
> ro = df2.c1.cat.reorder_categories([3,2,1])
> print("DType reordered:", repr(ro.dtype), "\n")
> {code}
>  
> Outputs:
>  
> {code:java}
> DType before: CategoricalDtype(categories=[1, 2, 3], ordered=False)
> Writeable: True
> DType reordered: CategoricalDtype(categories=[3, 2, 1], ordered=False)
> DType after: CategoricalDtype(categories=[1, 2, 3], ordered=False)
> Writeable: False
> ---
> ValueError Traceback (most recent call last)
>  in 
>  12 print("DType after:", repr(df2.c1.dtype))
>  13 print("Writeable:", df2.c1.cat.categories.values.flags.writeable)
> ---> 14 ro = df2.c1.cat.reorder_categories([3,2,1])
>  15 print("DType reordered:", repr(ro.dtype), "\n")
> {code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2818) [Python] Better error message when passing SparseDataFrame into Table.from_pandas

2019-06-07 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-2818:


Assignee: Joris Van den Bossche

> [Python] Better error message when passing SparseDataFrame into 
> Table.from_pandas
> -
>
> Key: ARROW-2818
> URL: https://issues.apache.org/jira/browse/ARROW-2818
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Joris Van den Bossche
>Priority: Major
> Fix For: 0.14.0
>
>
> This can be a rough edge for users. Note that pandas sparse support is being 
> considered for deprecation
> original issue https://github.com/apache/arrow/issues/1894



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2037) [Python]: Add tests for ARROW-1941 cases where pandas inferred type is 'empty'

2019-06-07 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858688#comment-16858688
 ] 

Joris Van den Bossche commented on ARROW-2037:
--

Not fully sure what is left to do here. The referenced issue ARROW-1941 was 
closed by https://github.com/apache/arrow/pull/1449, which included a test for 
roundtrip of columns with empty lists.

Unless it is not related to that issue? Because the example of ARROW-1941 
(column of empty lists) does not given an inferred type of 'empty' but 'mixed':

{code:python}
In [22]: empty_list_array = np.empty((3,), dtype=object) 
...: empty_list_array.fill([]) 
...:  
...: df = pd.DataFrame({'a': np.array(['1', '2', '3']), 
...:'b': empty_list_array}) 



In [23]: df 


Out[23]: 
   a   b
0  1  []
1  2  []
2  3  []

In [25]: pd.api.types.infer_dtype(df['b'], skipna=True) 


Out[25]: 'mixed'
{code}

> [Python]: Add tests for ARROW-1941 cases where pandas inferred type is 'empty'
> --
>
> Key: ARROW-2037
> URL: https://issues.apache.org/jira/browse/ARROW-2037
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2037) [Python]: Add tests for ARROW-1941 cases where pandas inferred type is 'empty'

2019-06-07 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858692#comment-16858692
 ] 

Joris Van den Bossche commented on ARROW-2037:
--

You get an "empty" inferred type if you have an object dtype with no rows or 
only missing values:

{code:python}
In [43]: s = pd.Series([], dtype=object)

    

In [44]: pd.api.types.infer_dtype(s, skipna=True)   

    
Out[44]: 'empty'

In [45]: s = pd.Series([None, None], dtype=object)  

    

In [46]: pd.api.types.infer_dtype(s, skipna=True)   

    
Out[46]: 'empty'
{code}

Not sure if that is what was meant, but the roundtrip for that is currently 
working, so can add a test for that.

 

> [Python]: Add tests for ARROW-1941 cases where pandas inferred type is 'empty'
> --
>
> Key: ARROW-2037
> URL: https://issues.apache.org/jira/browse/ARROW-2037
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-2037) [Python]: Add tests for ARROW-1941 cases where pandas inferred type is 'empty'

2019-06-07 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858692#comment-16858692
 ] 

Joris Van den Bossche edited comment on ARROW-2037 at 6/7/19 2:19 PM:
--

You get an "empty" inferred type if you have an object dtype with no rows or 
only missing values:

{code:python}
In [43]: s = pd.Series([], dtype=object)
     

In [44]: pd.api.types.infer_dtype(s, skipna=True)     
Out[44]: 'empty'

In [45]: s = pd.Series([None, None], dtype=object) 

In [46]: pd.api.types.infer_dtype(s, skipna=True)  
Out[46]: 'empty'
{code}

Not sure if that is what was meant, but the roundtrip for that is currently 
working, so can add a test for that.

 


was (Author: jorisvandenbossche):
You get an "empty" inferred type if you have an object dtype with no rows or 
only missing values:

{code:python}
In [43]: s = pd.Series([], dtype=object)

    

In [44]: pd.api.types.infer_dtype(s, skipna=True)   

    
Out[44]: 'empty'

In [45]: s = pd.Series([None, None], dtype=object)  

    

In [46]: pd.api.types.infer_dtype(s, skipna=True)   

    
Out[46]: 'empty'
{code}

Not sure if that is what was meant, but the roundtrip for that is currently 
working, so can add a test for that.

 

> [Python]: Add tests for ARROW-1941 cases where pandas inferred type is 'empty'
> --
>
> Key: ARROW-2037
> URL: https://issues.apache.org/jira/browse/ARROW-2037
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2037) [Python]: Add tests for ARROW-1941 cases where pandas inferred type is 'empty'

2019-06-07 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858694#comment-16858694
 ] 

Joris Van den Bossche commented on ARROW-2037:
--

And that case is already tested here: 
https://github.com/apache/arrow/blob/997226a9263430bc0422189180bc2551aed1f63d/python/pyarrow/tests/test_pandas.py#L2093-L2095

So closing this issue.

> [Python]: Add tests for ARROW-1941 cases where pandas inferred type is 'empty'
> --
>
> Key: ARROW-2037
> URL: https://issues.apache.org/jira/browse/ARROW-2037
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2037) [Python]: Add tests for ARROW-1941 cases where pandas inferred type is 'empty'

2019-06-07 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-2037:
-
Fix Version/s: (was: 0.14.0)

> [Python]: Add tests for ARROW-1941 cases where pandas inferred type is 'empty'
> --
>
> Key: ARROW-2037
> URL: https://issues.apache.org/jira/browse/ARROW-2037
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-2037) [Python]: Add tests for ARROW-1941 cases where pandas inferred type is 'empty'

2019-06-07 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche closed ARROW-2037.

Resolution: Invalid

> [Python]: Add tests for ARROW-1941 cases where pandas inferred type is 'empty'
> --
>
> Key: ARROW-2037
> URL: https://issues.apache.org/jira/browse/ARROW-2037
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1989) [Python] Better UX on timestamp conversion to Pandas

2019-06-07 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858724#comment-16858724
 ] 

Joris Van den Bossche commented on ARROW-1989:
--

Looking into this. But, I can't find a reproducible example which gives a 
similar error to what is reported above. Does somebody have a concrete example?

With latest pandas and pyarrow (and the same with pd 0.24.2 / pyarrow 0.12), I 
can get to something like this (having an timestamp with lower resolution that 
is out of bounds for pandas):

{code:python}
In [63]: a = pa.array([datetime.datetime(1018, 12, 12)], type=pa.timestamp('s'))

In [64]: a.to_pandas()
Out[64]: array(['1018-12-12T00:00:00'], dtype='datetime64[s]')

In [65]: table = pa.Table.from_pydict({'a': a})

In [66]: table
Out[66]: 
pyarrow.Table
a: timestamp[s]

In [67]: table.to_pandas()
Out[67]: 
  a
0 2188-01-19 23:09:07.419103232
{code}

This is a wrong result, however, and silently. This is a bug in pandas, and 
described in https://issues.apache.org/jira/browse/ARROW-3176

> [Python] Better UX on timestamp conversion to Pandas
> 
>
> Key: ARROW-1989
> URL: https://issues.apache.org/jira/browse/ARROW-1989
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Priority: Major
> Fix For: 0.14.0
>
>
> Converting timestamp columns to Pandas, users often have the problem that 
> they have dates that are larger than Pandas can represent with their 
> nanosecond representation. Currently they simply see an Arrow exception and 
> think that this problem is caused by Arrow. We should try to change the error 
> from
> {code}
> ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: XX
> {code}
> to something along the lines of 
> {code}
> ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: 
> XX. This conversion is needed as Pandas does only support nanosecond 
> timestamps. Your data is likely out of the range that can be represented with 
> nanosecond resolution.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-1989) [Python] Better UX on timestamp conversion to Pandas

2019-06-07 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858724#comment-16858724
 ] 

Joris Van den Bossche edited comment on ARROW-1989 at 6/7/19 3:00 PM:
--

Looking into this. But, I can't find a reproducible example which gives a 
similar error to what is reported above. Does somebody have a concrete example?

With latest pandas and pyarrow (and the same with pd 0.24.2 / pyarrow 0.12), I 
can get to something like this (having an timestamp with lower resolution that 
is out of bounds for pandas):

{code:python}
In [63]: a = pa.array([datetime.datetime(1018, 12, 12)], type=pa.timestamp('s'))

In [64]: a.to_pandas()
Out[64]: array(['1018-12-12T00:00:00'], dtype='datetime64[s]')

In [65]: table = pa.Table.from_pydict({'a': a})

In [66]: table
Out[66]: 
pyarrow.Table
a: timestamp[s]

In [67]: table.to_pandas()
Out[67]: 
  a
0 2188-01-19 23:09:07.419103232
{code}

This is a wrong result, however, and silently. This is a bug in pandas, and 
described in ARROW-3176


was (Author: jorisvandenbossche):
Looking into this. But, I can't find a reproducible example which gives a 
similar error to what is reported above. Does somebody have a concrete example?

With latest pandas and pyarrow (and the same with pd 0.24.2 / pyarrow 0.12), I 
can get to something like this (having an timestamp with lower resolution that 
is out of bounds for pandas):

{code:python}
In [63]: a = pa.array([datetime.datetime(1018, 12, 12)], type=pa.timestamp('s'))

In [64]: a.to_pandas()
Out[64]: array(['1018-12-12T00:00:00'], dtype='datetime64[s]')

In [65]: table = pa.Table.from_pydict({'a': a})

In [66]: table
Out[66]: 
pyarrow.Table
a: timestamp[s]

In [67]: table.to_pandas()
Out[67]: 
  a
0 2188-01-19 23:09:07.419103232
{code}

This is a wrong result, however, and silently. This is a bug in pandas, and 
described in https://issues.apache.org/jira/browse/ARROW-3176

> [Python] Better UX on timestamp conversion to Pandas
> 
>
> Key: ARROW-1989
> URL: https://issues.apache.org/jira/browse/ARROW-1989
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Priority: Major
> Fix For: 0.14.0
>
>
> Converting timestamp columns to Pandas, users often have the problem that 
> they have dates that are larger than Pandas can represent with their 
> nanosecond representation. Currently they simply see an Arrow exception and 
> think that this problem is caused by Arrow. We should try to change the error 
> from
> {code}
> ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: XX
> {code}
> to something along the lines of 
> {code}
> ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: 
> XX. This conversion is needed as Pandas does only support nanosecond 
> timestamps. Your data is likely out of the range that can be represented with 
> nanosecond resolution.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1989) [Python] Better UX on timestamp conversion to Pandas

2019-06-07 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858736#comment-16858736
 ] 

Joris Van den Bossche commented on ARROW-1989:
--

The mention of {{allow_truncated_timestamps=True}} led me to towards parquet, 
and with that I can indeed reproduce it (although for the case I reproduced 
below, it is about converting _from_ pandas  and not _to_ pandas):

{code:python}
In [85]: df = pd.DataFrame({'a': [pd.Timestamp("2019-01-01 
09:10:15.123456789")]})

In [86]: table = pa.Table.from_pandas(df)

In [88]: pq.write_table(table, '__test_datetime_highprecision.parquet')
...
ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: 
1546333815123456789

In [89]: pq.write_table(table, '__test_datetime_highprecision.parquet', 
allow_truncated_timestamps=True)

In [91]: pq.read_table('__test_datetime_highprecision.parquet').to_pandas()
Out[91]: 
   a
0 2019-01-01 09:10:15.123456
{code}

So indeed, in this case it would be nice to have a better error message that 
also points to this option.

However, for this specific case: shouldn't we be able to solve it now we have 
NANOS support in Parquet writing? (see 
https://issues.apache.org/jira/browse/ARROW-1957, which should be possible now 
the LogicalTypes PR is merged: PARQUET-1411)

In general though, there will be other cases where it could be useful to 
augment the arrow error message in Python. 

> [Python] Better UX on timestamp conversion to Pandas
> 
>
> Key: ARROW-1989
> URL: https://issues.apache.org/jira/browse/ARROW-1989
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Priority: Major
> Fix For: 0.14.0
>
>
> Converting timestamp columns to Pandas, users often have the problem that 
> they have dates that are larger than Pandas can represent with their 
> nanosecond representation. Currently they simply see an Arrow exception and 
> think that this problem is caused by Arrow. We should try to change the error 
> from
> {code}
> ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: XX
> {code}
> to something along the lines of 
> {code}
> ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: 
> XX. This conversion is needed as Pandas does only support nanosecond 
> timestamps. Your data is likely out of the range that can be represented with 
> nanosecond resolution.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3801) [Python] Pandas-Arrow roundtrip makes pd categorical index not writeable

2019-06-07 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858756#comment-16858756
 ] 

Joris Van den Bossche commented on ARROW-3801:
--

In general, or only for this specific case?

> [Python] Pandas-Arrow roundtrip makes pd categorical index not writeable
> 
>
> Key: ARROW-3801
> URL: https://issues.apache.org/jira/browse/ARROW-3801
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.10.0
>Reporter: Thomas Buhrmann
>Priority: Major
> Fix For: 0.14.0
>
>
> Serializing and deserializing a pandas series with categorical dtype will 
> make the categorical index non-writeable, which in turn trips up pandas when 
> e.g. reordering the categories, raising "ValueError: buffer source array is 
> read-only" :
> {code}
> import pandas as pd
> import pyarrow as pa
> df = pd.Series([1,2,3], dtype='category', name="c1").to_frame()
> print("DType before:", repr(df.c1.dtype))
> print("Writeable:", df.c1.cat.categories.values.flags.writeable)
> ro = df.c1.cat.reorder_categories([3,2,1])
> print("DType reordered:", repr(ro.dtype), "\n")
> tbl = pa.Table.from_pandas(df)
> df2 = tbl.to_pandas()
> print("DType after:", repr(df2.c1.dtype))
> print("Writeable:", df2.c1.cat.categories.values.flags.writeable)
> ro = df2.c1.cat.reorder_categories([3,2,1])
> print("DType reordered:", repr(ro.dtype), "\n")
> {code}
>  
> Outputs:
>  
> {code:java}
> DType before: CategoricalDtype(categories=[1, 2, 3], ordered=False)
> Writeable: True
> DType reordered: CategoricalDtype(categories=[3, 2, 1], ordered=False)
> DType after: CategoricalDtype(categories=[1, 2, 3], ordered=False)
> Writeable: False
> ---
> ValueError Traceback (most recent call last)
>  in 
>  12 print("DType after:", repr(df2.c1.dtype))
>  13 print("Writeable:", df2.c1.cat.categories.values.flags.writeable)
> ---> 14 ro = df2.c1.cat.reorder_categories([3,2,1])
>  15 print("DType reordered:", repr(ro.dtype), "\n")
> {code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2136) [Python] Non-nullable schema fields not checked in conversions from pandas

2019-06-11 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16861474#comment-16861474
 ] 

Joris Van den Bossche commented on ARROW-2136:
--

I have a PR for ARROW-5169 (https://github.com/apache/arrow/pull/4397), which 
tries to use this nullability of the passed {{schema}}, but I should check how 
my PR interacts with passing data with nulls.

> [Python] Non-nullable schema fields not checked in conversions from pandas
> --
>
> Key: ARROW-2136
> URL: https://issues.apache.org/jira/browse/ARROW-2136
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Joris Van den Bossche
>Priority: Major
> Fix For: 0.14.0
>
>
> If you provide a schema with {{nullable=False}} but pass a {{DataFrame}} 
> which in fact has nulls it appears the schema is ignored? I would expect an 
> error here.
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1.2, 2.1, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.float64(), nullable=False)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1.2,
>   2.1,
>   NA
> ]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5514) [C++] Printer for uint64 shows wrong values

2019-06-11 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16861504#comment-16861504
 ] 

Joris Van den Bossche commented on ARROW-5514:
--

Sorry for the slow reply (and thanks for the hint this might be a good C++ 
"easy" issue). I put it on my list of possible issues to tackle, but I have 
some others I first want to do. 

So if somebody else wants to take this up, feel free to do so!

> [C++] Printer for uint64 shows wrong values
> ---
>
> Key: ARROW-5514
> URL: https://issues.apache.org/jira/browse/ARROW-5514
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.13.0
>Reporter: Joris Van den Bossche
>Priority: Minor
>
> From the example in ARROW-5430:
> {code}
> In [16]: pa.array([14989096668145380166, 15869664087396458664], 
> type=pa.uint64()) 
>   
> Out[16]: 
> 
> [
>   -3457647405564171450,
>   -2577079986313092952
> ]
> {code}
> I _think_ the actual conversion is correct, and it's only the printer that is 
> going wrong, as {{to_numpy}} gives the correct values:
> {code}
> In [17]: pa.array([14989096668145380166, 15869664087396458664], 
> type=pa.uint64()).to_numpy()  
>   
> Out[17]: array([14989096668145380166, 15869664087396458664], dtype=uint64)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-840) [Python] Provide Python API for creating user-defined data types that can survive Arrow IPC

2019-06-11 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16861515#comment-16861515
 ] 

Joris Van den Bossche commented on ARROW-840:
-

So the first bullet point (enabling "defining extension types in Python) 
requires to implement a C++ PythonExtensionType that can translate python 
function callbacks to the actual ExtensionType methods?

I looked into that a bit some time ago, and I think it is above my current C++ 
skill level (at least to start it). [~pitrou] is that something that you might 
want to look at? 
Once the basics are there, I am very much interested to further help with this 
and do further work to enable pandas ExtensionArray interaction.

> [Python] Provide Python API for creating user-defined data types that can 
> survive Arrow IPC
> ---
>
> Key: ARROW-840
> URL: https://issues.apache.org/jira/browse/ARROW-840
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> The user will provide:
> * Data type subclass that can indicate the physical storage type
> * "get state" and "set state" functions for serializing custom metadata to 
> bytes
> * An optional function for "boxing" scalar values from the physical array 
> storage
> Internally, this will build on an analogous C++ API for defining user data 
> types



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5568) [Python] Allow parsing more general JSON formats

2019-06-11 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16861790#comment-16861790
 ] 

Joris Van den Bossche commented on ARROW-5568:
--

{quote}I have JSON data where the columnar (line-delimited) part is in a `data` 
subkey:{quote}

Note that the {{data}} subpart is not line delimited, but a comma-delimited 
JSON array. So that's a first thing that would be good to support.

Some additional resources that might be useful: in pandas there are many 
formats supported, called "orients", see the overview table at 
http://pandas.pydata.org/pandas-docs/version/0.24/user_guide/io.html#reading-json
 (disclaimer: I don't know how common the different formats are, so it doesn't 
necessarily makes sense to copy them all from pandas).

One of the formats is the JSON Table Schema 
(https://frictionlessdata.io/specs/table-schema/), which is a json file with a 
{{'metadata'}} and {{'data'}} top-level keys, where the {{'data'}} then 
consists of comma-delimited records (so very similar in structure as what 
[~dhirschfeld] showed above).

> [Python] Allow parsing more general JSON formats
> 
>
> Key: ARROW-5568
> URL: https://issues.apache.org/jira/browse/ARROW-5568
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Dave Hirschfeld
>Priority: Minor
>
> I have JSON data where the columnar (line-delimited) part is in a `data` 
> subkey:
> {code:java}
> {
>   "metadata": {"name": "block1"},
>   "data" : [
> {"a": 1, "b": 2.0, "c": "foo", "d": false},
> {"a": 4, "b": -5.5, "c": null, "d": true}
>   ]
> }
> {code}
>  
>  
> It would be good if the arrow JSON parser could allow specifying where the 
> columnar data is stored.
> Since the `metadata` is also important to me it would be even better if the 
> rest of the JSON could be returned as a Python dict with the only the 
> specified keys parsed as arrow tables - e.g.
>  
> {code:java}
> >>> block1 = json.read_json(fn, tables=['data'])
> >>> block1['data']
> pyarrow.Table
> a: int64
> b: double
> c: string
> d: bool
> >>> block1['metadata']
> {'name': 'block1'}
> >>> block1
> {
>   "metadata": {"name": "block1"},
>   "data" : pyarrow.Table
> }{code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-5424) [Doc] [Python] Add docs for JSON reader

2019-06-12 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche closed ARROW-5424.

Resolution: Duplicate

> [Doc] [Python] Add docs for JSON reader
> ---
>
> Key: ARROW-5424
> URL: https://issues.apache.org/jira/browse/ARROW-5424
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Python
>Reporter: Antoine Pitrou
>Priority: Major
>
> Similar to the [CSV reader 
> docs|https://arrow.apache.org/docs/python/csv.html], we should add docs for 
> the Python JSON reader.
> Also add the corresponding API docs 
> (https://arrow.apache.org/docs/python/api.html).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5562) pyarrow parquet writer does not handle negative zero correctly

2019-06-12 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-5562:
-
Labels: parquet  (was: )

> pyarrow parquet writer does not handle negative zero correctly
> --
>
> Key: ARROW-5562
> URL: https://issues.apache.org/jira/browse/ARROW-5562
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
>Reporter: Bob Briody
>Priority: Major
>  Labels: parquet
>
>  
> I have the following csv file (note that col_a contains a negative zero 
> value):
> {code:java}
> col_a,col_b
> 0.0,0.0
> -0.0,0.0{code}
> ...and process it via:
> {code:java}
> from pyarrow import csv, parquet
> in_csv = 'in.csv'
> table = csv.read_csv(in_csv)
> parquet.write_to_dataset(table, root_path='./'){code}
>  
> The output parquet file is then loaded into S3 and queried via AWS Athena 
> (i.e. PrestoDB / Hive). 
> Any query that touches {{col_a}} fails with the following error:
> {code:java}
> HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split {{REDACTED}} (offset=0, 
> length=593): low must be less than or equal to high{code}
>  
> As a sanity check, I transformed the csv file to parquet using an AWS Glue 
> Spark Job and I was able to query the output parquet file successfully.
> As such, it appears as though the pyarrow writer is producing an invalid 
> parquet file when a column contains both 0.0 and -0.0, and only 0.0 and -0.0.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5562) pyarrow parquet writer does not handle negative zero correctly

2019-06-12 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-5562:
-
Component/s: C++

> pyarrow parquet writer does not handle negative zero correctly
> --
>
> Key: ARROW-5562
> URL: https://issues.apache.org/jira/browse/ARROW-5562
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.13.0
>Reporter: Bob Briody
>Priority: Major
>  Labels: parquet
>
>  
> I have the following csv file (note that col_a contains a negative zero 
> value):
> {code:java}
> col_a,col_b
> 0.0,0.0
> -0.0,0.0{code}
> ...and process it via:
> {code:java}
> from pyarrow import csv, parquet
> in_csv = 'in.csv'
> table = csv.read_csv(in_csv)
> parquet.write_to_dataset(table, root_path='./'){code}
>  
> The output parquet file is then loaded into S3 and queried via AWS Athena 
> (i.e. PrestoDB / Hive). 
> Any query that touches {{col_a}} fails with the following error:
> {code:java}
> HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split {{REDACTED}} (offset=0, 
> length=593): low must be less than or equal to high{code}
>  
> As a sanity check, I transformed the csv file to parquet using an AWS Glue 
> Spark Job and I was able to query the output parquet file successfully.
> As such, it appears as though the pyarrow writer is producing an invalid 
> parquet file when a column contains both 0.0 and -0.0, and only 0.0 and -0.0.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5562) [C++] parquet writer does not handle negative zero correctly

2019-06-12 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-5562:
-
Summary: [C++] parquet writer does not handle negative zero correctly  
(was: pyarrow parquet writer does not handle negative zero correctly)

> [C++] parquet writer does not handle negative zero correctly
> 
>
> Key: ARROW-5562
> URL: https://issues.apache.org/jira/browse/ARROW-5562
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.13.0
>Reporter: Bob Briody
>Priority: Major
>  Labels: parquet
>
>  
> I have the following csv file (note that col_a contains a negative zero 
> value):
> {code:java}
> col_a,col_b
> 0.0,0.0
> -0.0,0.0{code}
> ...and process it via:
> {code:java}
> from pyarrow import csv, parquet
> in_csv = 'in.csv'
> table = csv.read_csv(in_csv)
> parquet.write_to_dataset(table, root_path='./'){code}
>  
> The output parquet file is then loaded into S3 and queried via AWS Athena 
> (i.e. PrestoDB / Hive). 
> Any query that touches {{col_a}} fails with the following error:
> {code:java}
> HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split {{REDACTED}} (offset=0, 
> length=593): low must be less than or equal to high{code}
>  
> As a sanity check, I transformed the csv file to parquet using an AWS Glue 
> Spark Job and I was able to query the output parquet file successfully.
> As such, it appears as though the pyarrow writer is producing an invalid 
> parquet file when a column contains both 0.0 and -0.0, and only 0.0 and -0.0.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5540) [Python] pa.lib.tzinfo_to_string(tz) throws ValueError: Unable to convert timezone `tzoffset(None, -14400)` to string

2019-06-12 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16861965#comment-16861965
 ] 

Joris Van den Bossche commented on ARROW-5540:
--

[~Koojav] Thanks for the report.

Going from the output, I assume that you are using a {{dateutil}} timezone, is 
that correct? ({{type(dtype.tz)}} should show it)

At the moment, pyarrow only supports {{pytz}} timezones and the standard 
library's fixed offset timezone. See ARROW-5248 for that.

> [Python] pa.lib.tzinfo_to_string(tz) throws ValueError: Unable to convert 
> timezone `tzoffset(None, -14400)` to string
> -
>
> Key: ARROW-5540
> URL: https://issues.apache.org/jira/browse/ARROW-5540
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Michał Kujawski
>Priority: Major
>
> *Overview:*
> When trying to save DataFrame to parquet error is thrown while parsing a 
> column with the following properties:
>  
> {code:java}
> dtype: datetime64[ns, tzoffset(None, -14400)]
> dtype.tz: tzoffset(None, -14400)
> {code}
>  
>  
> *Error:* 
> {code:java}
> ValueError: Unable to convert timezone `tzoffset(None, -14400)` to 
> string{code}
>  
> *Error stack:*
> {code:java}
> File "pyarrow/table.pxi", line 1139, in pyarrow.lib.Table.from_pandas
> File 
> "/home/koojav/projects/toptal/teftel/.venv/lib/python3.6/site-packages/pyarrow/pandas_compat.py",
>  line 480, in dataframe_to_arrays
> types)
> File 
> "/home/koojav/projects/toptal/teftel/.venv/lib/python3.6/site-packages/pyarrow/pandas_compat.py",
>  line 209, in construct_metadata
> field_name=sanitized_name)
> File 
> "/home/koojav/projects/toptal/teftel/.venv/lib/python3.6/site-packages/pyarrow/pandas_compat.py",
>  line 153, in get_column_metadata
> string_dtype, extra_metadata = get_extension_dtype_info(column)
> File 
> "/home/koojav/projects/toptal/teftel/.venv/lib/python3.6/site-packages/pyarrow/pandas_compat.py",
>  line 126, in get_extension_dtype_info
> metadata = {'timezone': pa.lib.tzinfo_to_string(dtype.tz)}
> File "pyarrow/types.pxi", line 1149, in pyarrow.lib.tzinfo_to_string
> ValueError: Unable to convert timezone `tzoffset(None, -14400)` to string
> {code}
> *Libraries:*
>  * pandas 0.24.2
>  * pyarrow 0.13.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5248) [Python] support dateutil timezones

2019-06-12 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16861969#comment-16861969
 ] 

Joris Van den Bossche commented on ARROW-5248:
--

Another example of dateutil timezone was reported in ARROW-5540.

Reproducible example:

{code:python}
In [32]: import dateutil

In [33]: tz = dateutil.tz.tzoffset(None, -14400)

In [34]: pd.Timestamp("2019-01-01", tz=tz)
Out[34]: Timestamp('2019-01-01 00:00:00-0400', tz='tzoffset(None, -14400)')

In [39]: pa.array(pd.Series([pd.Timestamp("2019-01-01", tz=tz)]))
...
ValueError: Unable to convert timezone `tzoffset(None, -14400)` to string
{code}

> [Python] support dateutil timezones
> ---
>
> Key: ARROW-5248
> URL: https://issues.apache.org/jira/browse/ARROW-5248
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Minor
>
> The {{dateutil}} packages also provides a set of timezone objects 
> (https://dateutil.readthedocs.io/en/stable/tz.html) in addition to {{pytz}}. 
> In pyarrow, we only support pytz timezones (and the stdlib datetime.timezone 
> fixed offset):
> {code}
> In [2]: import dateutil.tz
>   
>   
> In [3]: import pyarrow as pa  
>   
>   
> In [5]: pa.timestamp('us', dateutil.tz.gettz('Europe/Brussels'))  
>   
>   
> ...
> ~/miniconda3/envs/dev37/lib/python3.7/site-packages/pyarrow/types.pxi in 
> pyarrow.lib.tzinfo_to_string()
> ValueError: Unable to convert timezone 
> `tzfile('/usr/share/zoneinfo/Europe/Brussels')` to string
> {code}
> But pandas also supports dateutil timezones. As a consequence, when having a 
> pandas DataFrame that uses a dateutil timezone, you get an error when 
> converting to an arrow table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5572) [Python] raise error message when passing invalid filter in parquet reading

2019-06-12 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5572:


 Summary: [Python] raise error message when passing invalid filter 
in parquet reading
 Key: ARROW-5572
 URL: https://issues.apache.org/jira/browse/ARROW-5572
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.13.0
Reporter: Joris Van den Bossche


>From 
>https://stackoverflow.com/questions/56522977/using-predicates-to-filter-rows-from-pyarrow-parquet-parquetdataset

For example, when specifying a column in the filter which is a normal column 
and not a key in your partitioned folder hierarchy, the filter gets silently 
ignored. It would be nice to get an error message for this.  
Reproducible example:

{code:python}
df = pd.DataFrame({'a': [0, 0, 1, 1], 'b': [0, 1, 0, 1], 'c': [1, 2, 3, 4]})
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, 'test_parquet_row_filters', partition_cols=['a'])
# filter on 'a' (partition column) -> works
pq.read_table('test_parquet_row_filters', filters=[('a', '=', 1)]).to_pandas()
# filter on normal column (in future could do row group filtering) -> silently 
does nothing
pq.read_table('test_parquet_row_filters', filters=[('b', '=', 1)]).to_pandas()
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5532) [JS] Field Metadata Not Read

2019-06-12 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-5532:
-
Labels: Javas  (was: )

> [JS] Field Metadata Not Read
> 
>
> Key: ARROW-5532
> URL: https://issues.apache.org/jira/browse/ARROW-5532
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.13.0
> Environment: Mac OSX 10.14, Chrome 74
>Reporter: Trey Hakanson
>Assignee: Paul Taylor
>Priority: Major
>  Labels: Javas
> Fix For: 0.14.0
>
>
> Field metadata is not read when using {{@apache-arrow/ts@0.13.0}}. Example 
> below also uses {{pyarrow==0.13.0}}
> Steps to reproduce:
> Adding metadata:
> {code:title=toarrow.py|borderStyle=solid}
> import pyarrow as pa
> import pandas as pd
> source = "sample.csv"
> output = "sample.arrow"
> df = pd.read_csv(source)
> table = pa.Table.from_pandas(df)
> schema = pa.schema([
>  column.field.add_metadata({"foo": "bar"}))
>  for column
>  in table.columns
> ])
> writer = pa.RecordBatchFileWriter(output, schema)
> writer.write(table)
> writer.close()
> {code}
> Reading field metadata using {{pyarrow}}:
> {code:title=readarrow.py|borderStyle=solid}
> source = "sample.arrow"
> field = "foo"
> reader = pa.RecordBatchFileReader(source)
> reader.schema.field_by_name(field).metadata # Correctly shows `{"foo": "bar"}`
> {code}
> Reading field metadata using {{@apache-arrow/ts}}:
> {code:title=toarrow.ts|borderStyle=solid}
> import { Table, Field, Type } from "@apache-arrow/ts";
> const url = "https://example.com/sample.arrow";;
> const buf = await fetch(url).then(res => res.arrayBuffer());
> const table = Table.from([new Uint8Array(buf)]);
> for (let field of table.schema.fields) {
>  field.metadata; // Incorrectly shows an empty map
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5532) [JS] Field Metadata Not Read

2019-06-12 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-5532:
-
Component/s: JavaScript

> [JS] Field Metadata Not Read
> 
>
> Key: ARROW-5532
> URL: https://issues.apache.org/jira/browse/ARROW-5532
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 0.13.0
> Environment: Mac OSX 10.14, Chrome 74
>Reporter: Trey Hakanson
>Assignee: Paul Taylor
>Priority: Major
>  Labels: Javas
> Fix For: 0.14.0
>
>
> Field metadata is not read when using {{@apache-arrow/ts@0.13.0}}. Example 
> below also uses {{pyarrow==0.13.0}}
> Steps to reproduce:
> Adding metadata:
> {code:title=toarrow.py|borderStyle=solid}
> import pyarrow as pa
> import pandas as pd
> source = "sample.csv"
> output = "sample.arrow"
> df = pd.read_csv(source)
> table = pa.Table.from_pandas(df)
> schema = pa.schema([
>  column.field.add_metadata({"foo": "bar"}))
>  for column
>  in table.columns
> ])
> writer = pa.RecordBatchFileWriter(output, schema)
> writer.write(table)
> writer.close()
> {code}
> Reading field metadata using {{pyarrow}}:
> {code:title=readarrow.py|borderStyle=solid}
> source = "sample.arrow"
> field = "foo"
> reader = pa.RecordBatchFileReader(source)
> reader.schema.field_by_name(field).metadata # Correctly shows `{"foo": "bar"}`
> {code}
> Reading field metadata using {{@apache-arrow/ts}}:
> {code:title=toarrow.ts|borderStyle=solid}
> import { Table, Field, Type } from "@apache-arrow/ts";
> const url = "https://example.com/sample.arrow";;
> const buf = await fetch(url).then(res => res.arrayBuffer());
> const table = Table.from([new Uint8Array(buf)]);
> for (let field of table.schema.fields) {
>  field.metadata; // Incorrectly shows an empty map
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5540) [Python] pa.lib.tzinfo_to_string(tz) throws ValueError: Unable to convert timezone `tzoffset(None, -14400)` to string

2019-06-12 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862079#comment-16862079
 ] 

Joris Van den Bossche commented on ARROW-5540:
--

Thanks for the follow-up. OK, since it is due to not supporting dateutil at the 
moment, going to close this issue as a duplicate of ARROW-5248.

To prevent this issue, you should be able to prevent that pandas creates a 
dateutil timezone instead of a pytz timezone (not sure how that happened in 
your case).

> [Python] pa.lib.tzinfo_to_string(tz) throws ValueError: Unable to convert 
> timezone `tzoffset(None, -14400)` to string
> -
>
> Key: ARROW-5540
> URL: https://issues.apache.org/jira/browse/ARROW-5540
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Michał Kujawski
>Priority: Major
>
> *Overview:*
> When trying to save DataFrame to parquet error is thrown while parsing a 
> column with the following properties:
>  
> {code:java}
> dtype: datetime64[ns, tzoffset(None, -14400)]
> dtype.tz: tzoffset(None, -14400)
> {code}
>  
>  
> *Error:* 
> {code:java}
> ValueError: Unable to convert timezone `tzoffset(None, -14400)` to 
> string{code}
>  
> *Error stack:*
> {code:java}
> File "pyarrow/table.pxi", line 1139, in pyarrow.lib.Table.from_pandas
> File 
> "/home/koojav/projects/toptal/teftel/.venv/lib/python3.6/site-packages/pyarrow/pandas_compat.py",
>  line 480, in dataframe_to_arrays
> types)
> File 
> "/home/koojav/projects/toptal/teftel/.venv/lib/python3.6/site-packages/pyarrow/pandas_compat.py",
>  line 209, in construct_metadata
> field_name=sanitized_name)
> File 
> "/home/koojav/projects/toptal/teftel/.venv/lib/python3.6/site-packages/pyarrow/pandas_compat.py",
>  line 153, in get_column_metadata
> string_dtype, extra_metadata = get_extension_dtype_info(column)
> File 
> "/home/koojav/projects/toptal/teftel/.venv/lib/python3.6/site-packages/pyarrow/pandas_compat.py",
>  line 126, in get_extension_dtype_info
> metadata = {'timezone': pa.lib.tzinfo_to_string(dtype.tz)}
> File "pyarrow/types.pxi", line 1149, in pyarrow.lib.tzinfo_to_string
> ValueError: Unable to convert timezone `tzoffset(None, -14400)` to string
> {code}
> *Libraries:*
>  * pandas 0.24.2
>  * pyarrow 0.13.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-5540) [Python] pa.lib.tzinfo_to_string(tz) throws ValueError: Unable to convert timezone `tzoffset(None, -14400)` to string

2019-06-12 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche closed ARROW-5540.

Resolution: Duplicate

> [Python] pa.lib.tzinfo_to_string(tz) throws ValueError: Unable to convert 
> timezone `tzoffset(None, -14400)` to string
> -
>
> Key: ARROW-5540
> URL: https://issues.apache.org/jira/browse/ARROW-5540
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Michał Kujawski
>Priority: Major
>
> *Overview:*
> When trying to save DataFrame to parquet error is thrown while parsing a 
> column with the following properties:
>  
> {code:java}
> dtype: datetime64[ns, tzoffset(None, -14400)]
> dtype.tz: tzoffset(None, -14400)
> {code}
>  
>  
> *Error:* 
> {code:java}
> ValueError: Unable to convert timezone `tzoffset(None, -14400)` to 
> string{code}
>  
> *Error stack:*
> {code:java}
> File "pyarrow/table.pxi", line 1139, in pyarrow.lib.Table.from_pandas
> File 
> "/home/koojav/projects/toptal/teftel/.venv/lib/python3.6/site-packages/pyarrow/pandas_compat.py",
>  line 480, in dataframe_to_arrays
> types)
> File 
> "/home/koojav/projects/toptal/teftel/.venv/lib/python3.6/site-packages/pyarrow/pandas_compat.py",
>  line 209, in construct_metadata
> field_name=sanitized_name)
> File 
> "/home/koojav/projects/toptal/teftel/.venv/lib/python3.6/site-packages/pyarrow/pandas_compat.py",
>  line 153, in get_column_metadata
> string_dtype, extra_metadata = get_extension_dtype_info(column)
> File 
> "/home/koojav/projects/toptal/teftel/.venv/lib/python3.6/site-packages/pyarrow/pandas_compat.py",
>  line 126, in get_extension_dtype_info
> metadata = {'timezone': pa.lib.tzinfo_to_string(dtype.tz)}
> File "pyarrow/types.pxi", line 1149, in pyarrow.lib.tzinfo_to_string
> ValueError: Unable to convert timezone `tzoffset(None, -14400)` to string
> {code}
> *Libraries:*
>  * pandas 0.24.2
>  * pyarrow 0.13.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2298) [Python] Add option to not consider NaN to be null when converting to an integer Arrow type

2019-06-12 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862275#comment-16862275
 ] 

Joris Van den Bossche commented on ARROW-2298:
--

I am not sure I fully understand your question. I am passing a numpy array to 
{{pa.array}}, and creating a numpy array with np.nan or None ({{np.array([1, 2, 
None], dtype=float)}} or {{np.array([1, 2, np.nan], dtype=float)}}) give the 
same float64 numpy array. 

Also if you have a pandas object like {{pd.Series([1, 2, None])}} this will 
actually be stored as a float array with np.nan.

> [Python] Add option to not consider NaN to be null when converting to an 
> integer Arrow type
> ---
>
> Key: ARROW-2298
> URL: https://issues.apache.org/jira/browse/ARROW-2298
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Follow-on work to ARROW-2135



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-3686) [Python] Support for masked arrays in to/from numpy

2019-06-12 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-3686:


Assignee: Joris Van den Bossche

> [Python] Support for masked arrays in to/from numpy
> ---
>
> Key: ARROW-3686
> URL: https://issues.apache.org/jira/browse/ARROW-3686
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.11.1
>Reporter: Maarten Breddels
>Assignee: Joris Van den Bossche
>Priority: Major
> Fix For: 0.14.0
>
>
> Again, in this PR for vaex: 
> [https://github.com/maartenbreddels/vaex/pull/116] I support masked arrays, 
> it would be nice if this goes into pyarrow. If this approach looks good I 
> could do a PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5220) [Python] index / unknown columns in specified schema in Table.from_pandas

2019-06-13 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863527#comment-16863527
 ] 

Joris Van den Bossche commented on ARROW-5220:
--

I can look into taking the index columns into account for matching with the 
schema. 

One complication will be the RangeIndex serialization (as metadata instead of 
column). Related discussion in ARROW-5427, an option could be to have 
{{preserve_index=True}} force a RangeIndex to always be serialized as actual 
data instead of the default metadata. That could make the expected schema 
consistent independent of whether your dataframe has a RangeIndex vs 
Int64Index, and thus easier to match a specified schema.

> [Python] index / unknown columns in specified schema in Table.from_pandas
> -
>
> Key: ARROW-5220
> URL: https://issues.apache.org/jira/browse/ARROW-5220
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Minor
>
> The {{Table.from_pandas}} method allows to specify a schema ("This can be 
> used to indicate the type of columns if we cannot infer it automatically.").
> But, if you also want to specify the type of the index, you get an error:
> {code:python}
> df = pd.DataFrame({'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3]})
> df.index = pd.Index(['a', 'b', 'c'], name='index')
> my_schema = pa.schema([('index', pa.string()),
>    ('a', pa.int64()),
>    ('b', pa.float64()),
>   ])
> table = pa.Table.from_pandas(df, schema=my_schema)
> {code}
> gives {{KeyError: 'index'}} (because it tries to look up the "column names" 
> from the schema in the dataframe, and thus does not find column 'index').
> This also has the consequence that re-using the schema does not work: 
> {{table1 = pa.Table.from_pandas(df1);  table2 = pa.Table.from_pandas(df2, 
> schema=table1.schema)}}
> Extra note: also unknown columns in general give this error (column specified 
> in the schema that are not in the dataframe).
> At least in pyarrow 0.11, this did not give an error (eg noticed this from 
> the code in example in ARROW-3861). So before, unknown columns in the 
> specified schema were ignored, while now they raise an error. Was this a 
> conscious change?  
> So before also specifying the index in the schema "worked" in the sense that 
> it didn't raise an error, but it was also ignored, so didn't actually do what 
> you would expect)
> Questions:
> - I think that we should support specifying the index in the passed 
> {{schema}} ? So that the example above works (although this might be 
> complicated with RangeIndex that is not serialized any more)
> - But what to do in general with additional columns in the schema that are 
> not in the DataFrame? Are we fine with keep raising an error as it is now 
> (the error message could be improved then)? Or do we again want to ignore 
> them? (or, it could actually also add them as all nulls to the table)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5603) [Python] registere pytest markers to avoid warnings

2019-06-14 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5603:


 Summary: [Python] registere pytest markers to avoid warnings
 Key: ARROW-5603
 URL: https://issues.apache.org/jira/browse/ARROW-5603
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche
 Fix For: 0.14.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5603) [Python] registere pytest markers to avoid warnings

2019-06-14 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-5603:
-
Description: 
Currently the python test suite gives warnings like:

{code}
/home/joris/miniconda3/envs/arrow-dev/lib/python3.7/site-packages/_pytest/mark/structures.py:337
  
/home/joris/miniconda3/envs/arrow-dev/lib/python3.7/site-packages/_pytest/mark/structures.py:337:
 PytestUnknownMarkWarning: Unknown pytest.mark.pandas - is this a typo?  You 
can register custom marks to avoid this warning - for details, see 
https://docs.pytest.org/en/latest/mark.html
PytestUnknownMarkWarning,
{code}

> [Python] registere pytest markers to avoid warnings
> ---
>
> Key: ARROW-5603
> URL: https://issues.apache.org/jira/browse/ARROW-5603
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently the python test suite gives warnings like:
> {code}
> /home/joris/miniconda3/envs/arrow-dev/lib/python3.7/site-packages/_pytest/mark/structures.py:337
>   
> /home/joris/miniconda3/envs/arrow-dev/lib/python3.7/site-packages/_pytest/mark/structures.py:337:
>  PytestUnknownMarkWarning: Unknown pytest.mark.pandas - is this a typo?  You 
> can register custom marks to avoid this warning - for details, see 
> https://docs.pytest.org/en/latest/mark.html
> PytestUnknownMarkWarning,
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5606) [Python] pandas.RangeIndex._start/_stop/_step are deprecated

2019-06-14 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5606:


 Summary: [Python] pandas.RangeIndex._start/_stop/_step are 
deprecated
 Key: ARROW-5606
 URL: https://issues.apache.org/jira/browse/ARROW-5606
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche
 Fix For: 0.14.0


There are public attributes added RangeIndex.start/stop/step, and the private 
{{_start/_stop/_step}} are deprecated. See 
https://github.com/pandas-dev/pandas/pull/26581



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5618) [C++] [Parquet] Using deprecated Int96 storage for timestamps triggers integer overflow in some cases

2019-06-17 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-5618:
-
Labels: parquet  (was: )

> [C++] [Parquet] Using deprecated Int96 storage for timestamps triggers 
> integer overflow in some cases
> -
>
> Key: ARROW-5618
> URL: https://issues.apache.org/jira/browse/ARROW-5618
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: TP Boudreau
>Assignee: TP Boudreau
>Priority: Minor
>  Labels: parquet
>
> When storing Arrow timestamps in Parquet files using the Int96 storage 
> format, certain combinations of array lengths and validity bitmasks cause an 
> integer overflow error on read.  It's not immediately clear whether the 
> Arrow/Parquet writer is storing zeroes when it should be storing positive 
> values or the reader is attempting to calculate a nanoseconds value 
> inappropriately from zeroed inputs (perhaps missing the null bit flag).  Also 
> not immediately clear why only certain length columns seem to be affected.
> Probably the quickest way to reproduce this undefined behavior is to alter 
> the existing unit test UseDeprecatedInt96 (in file 
> .../arrow/cpp/src/parquet/arrow/arrow-reader-writer-test.cc) by quadrupling 
> its column lengths (repeating the same values), followed by 'make unittest' 
> using clang-7 with sanitizers enabled.  (Here's a patch applicable to current 
> master that changes the test as described: [1]; I used the following cmake 
> command to build my environment: [2].)  You should get a log something like 
> [3].  If requested, I'll see if I can put together a stand-alone minimal test 
> case that induces the behavior.
> The quick-hack at [4] will prevent integer overflows, but this is only 
> included to confirm the proximate cause of the bug: the Julian days field of 
> the Int96 appears to be zero, when a strictly positive number is expected.
> I've assigned the issue to myself and I'll start looking into the root cause 
> of this.
> [1] https://gist.github.com/tpboudreau/b6610c13cbfede4d6b171da681d1f94e
> [2] https://gist.github.com/tpboudreau/59178ca8cb50a935aab7477805aa32b9
> [3] https://gist.github.com/tpboudreau/0c2d0a18960c1aa04c838fa5c2ac7d2d
> [4] https://gist.github.com/tpboudreau/0993beb5c8c1488028e76fb2ca179b7f



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5208) [Python] Inconsistent resulting type during casting in pa.array() when mask is present

2019-06-17 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865558#comment-16865558
 ] 

Joris Van den Bossche commented on ARROW-5208:
--

[~ArtemK] still interested to take a look at this?

> [Python] Inconsistent resulting type during casting in pa.array() when mask 
> is present
> --
>
> Key: ARROW-5208
> URL: https://issues.apache.org/jira/browse/ARROW-5208
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
>Reporter: Artem KOZHEVNIKOV
>Priority: Major
> Fix For: 0.14.0
>
>
> I would expect Int64Array type in all cases below :
> {code:java}
> >>> pa.array([4, None, 4, None], mask=np.array([False, True, False, True]))   
> >>>                                                                           
> >>>      
>  [4, null, 4,  null ]
> >>> pa.array([4, None, 4, 'rer'], mask=np.array([False, True, False, True]))  
> >>>                                                                           
> >>>         
>  [4, null, 4,  null ]
> >>> pa.array([4, None, 4, 3.], mask=np.array([False, True, False, True]))     
> >>>                                                                           
> >>>           [   4,   null,   
> >>> 4,   null ]{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5220) [Python] index / unknown columns in specified schema in Table.from_pandas

2019-06-17 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865966#comment-16865966
 ] 

Joris Van den Bossche commented on ARROW-5220:
--

[~wesmckinn] what do you think of the idea that `preserve_index=True` forces 
RangeIndex to be an actual column in the Table? (to have a consistent match 
with a specified schema)

> [Python] index / unknown columns in specified schema in Table.from_pandas
> -
>
> Key: ARROW-5220
> URL: https://issues.apache.org/jira/browse/ARROW-5220
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Minor
>
> The {{Table.from_pandas}} method allows to specify a schema ("This can be 
> used to indicate the type of columns if we cannot infer it automatically.").
> But, if you also want to specify the type of the index, you get an error:
> {code:python}
> df = pd.DataFrame({'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3]})
> df.index = pd.Index(['a', 'b', 'c'], name='index')
> my_schema = pa.schema([('index', pa.string()),
>    ('a', pa.int64()),
>    ('b', pa.float64()),
>   ])
> table = pa.Table.from_pandas(df, schema=my_schema)
> {code}
> gives {{KeyError: 'index'}} (because it tries to look up the "column names" 
> from the schema in the dataframe, and thus does not find column 'index').
> This also has the consequence that re-using the schema does not work: 
> {{table1 = pa.Table.from_pandas(df1);  table2 = pa.Table.from_pandas(df2, 
> schema=table1.schema)}}
> Extra note: also unknown columns in general give this error (column specified 
> in the schema that are not in the dataframe).
> At least in pyarrow 0.11, this did not give an error (eg noticed this from 
> the code in example in ARROW-3861). So before, unknown columns in the 
> specified schema were ignored, while now they raise an error. Was this a 
> conscious change?  
> So before also specifying the index in the schema "worked" in the sense that 
> it didn't raise an error, but it was also ignored, so didn't actually do what 
> you would expect)
> Questions:
> - I think that we should support specifying the index in the passed 
> {{schema}} ? So that the example above works (although this might be 
> complicated with RangeIndex that is not serialized any more)
> - But what to do in general with additional columns in the schema that are 
> not in the DataFrame? Are we fine with keep raising an error as it is now 
> (the error message could be improved then)? Or do we again want to ignore 
> them? (or, it could actually also add them as all nulls to the table)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-5309) [Python] Add clarifications to Python "append" methods that return new objects

2019-06-18 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-5309:


Assignee: Joris Van den Bossche

> [Python] Add clarifications to Python "append" methods that return new objects
> --
>
> Key: ARROW-5309
> URL: https://issues.apache.org/jira/browse/ARROW-5309
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Joris Van den Bossche
>Priority: Major
> Fix For: 0.14.0
>
>
> The current docstrings do say that an object is returned but it is not clear 
> in all cases that it is a new object and the original object is left 
> unmodified
> see example thread
> https://github.com/apache/arrow/issues/4296



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-4076) [Python] schema validation and filters

2019-06-18 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-4076:


Assignee: Joris Van den Bossche

> [Python] schema validation and filters
> --
>
> Key: ARROW-4076
> URL: https://issues.apache.org/jira/browse/ARROW-4076
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: George Sakkis
>Assignee: Joris Van den Bossche
>Priority: Minor
>  Labels: datasets, easyfix, parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Currently [schema 
> validation|https://github.com/apache/arrow/blob/758bd557584107cb336cbc3422744dacd93978af/python/pyarrow/parquet.py#L900]
>  of {{ParquetDataset}} takes place before filtering. This may raise a 
> {{ValueError}} if the schema is different in some dataset pieces, even if 
> these pieces would be subsequently filtered out. I think validation should 
> happen after filtering to prevent such spurious errors:
> {noformat}
> --- a/pyarrow/parquet.py  
> +++ b/pyarrow/parquet.py  
> @@ -878,13 +878,13 @@
>  if split_row_groups:
>  raise NotImplementedError("split_row_groups not yet implemented")
>  
> -if validate_schema:
> -self.validate_schemas()
> -
>  if filters is not None:
>  filters = _check_filters(filters)
>  self._filter(filters)
>  
> +if validate_schema:
> +self.validate_schemas()
> +
>  def validate_schemas(self):
>  open_file = self._get_open_file_func()
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-4847) [Python] Add pyarrow.table factory function that dispatches to various ctors based on type of input

2019-06-18 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-4847:


Assignee: Joris Van den Bossche  (was: Wes McKinney)

> [Python] Add pyarrow.table factory function that dispatches to various ctors 
> based on type of input
> ---
>
> Key: ARROW-4847
> URL: https://issues.apache.org/jira/browse/ARROW-4847
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Joris Van den Bossche
>Priority: Major
> Fix For: 0.14.0
>
>
> For example, in {{pyarrow.table(df)}} if {{df}} is a {{pandas.DataFrame}}, 
> then table will dispatch to {{pa.Table.from_pandas}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2572) [Python] Add factory function to create a Table from Columns and Schema.

2019-06-18 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16866393#comment-16866393
 ] 

Joris Van den Bossche commented on ARROW-2572:
--

The {{Table.from_arrays}} nowadays clearly mentions that it accepts both arrays 
as columns: 
https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.from_arrays
 (and also the schema is now mentioned in the docstring).

Of course, when looking for a method to create a Table from columns, I agree a 
{{from_columns}} is certainly more discoverable. But it might also be API 
clutter ..

Would we want to make it an exact alias? Or actually a separate function that 
is more strict (not allowing a list of arrays)? Because for a list of columns, 
you would not necessarily need to pass the column names (or schema), as is 
required in {{from_arrays}}.

> [Python] Add factory function to create a Table from Columns and Schema.
> 
>
> Key: ARROW-2572
> URL: https://issues.apache.org/jira/browse/ARROW-2572
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Thomas Buhrmann
>Priority: Minor
>  Labels: beginner
> Fix For: 0.14.0
>
>
> At the moment it seems to be impossible in Python to add custom metadata to a 
> Table or Column. The closest I've come is to create a list of new Fields (by 
> "appending" metadata to existing Fields), and then creating a new Schema from 
> these Fields using the Schema factory function. But I can't see how to create 
> a new table from the existing Columns and my new Schema, which I understand 
> would be the way to do it in C++?
> Essentially, wrappers for the Table's Make(...) functions seem to be missing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-5241) [Python] Add option to disable writing statistics to parquet file

2019-06-18 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-5241:


Assignee: Joris Van den Bossche

> [Python] Add option to disable writing statistics to parquet file
> -
>
> Key: ARROW-5241
> URL: https://issues.apache.org/jira/browse/ARROW-5241
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Deepak Majeti
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: parquet
> Fix For: 0.14.0
>
>
> C++  Parquet API exposes an option to disable writing statistics when writing 
> a Parquet file.
> It will be useful to expose this API in the Python Arrow API as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5654) [C++] ChunkedArray should validate the types of the arrays

2019-06-19 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5654:


 Summary: [C++] ChunkedArray should validate the types of the arrays
 Key: ARROW-5654
 URL: https://issues.apache.org/jira/browse/ARROW-5654
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


Example from Python, showing that you can currently create a ChunkedArray with 
incompatible types:

{code:python}
In [8]: a1 = pa.array([1, 2])

In [9]: a2 = pa.array(['a', 'b'])

In [10]: pa.chunked_array([a1, a2])
Out[10]:

[
  [
1,
2
  ],
  [
"a",
"b"
  ]
]
{code}

So a {{ChunkedArray::Validate}} can be implemented (and which should probably 
be called by default upon creation?)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5655) [Python] Table.from_pydict/from_arrays not using types in specified schema correctly

2019-06-19 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5655:


 Summary: [Python] Table.from_pydict/from_arrays not using types in 
specified schema correctly 
 Key: ARROW-5655
 URL: https://issues.apache.org/jira/browse/ARROW-5655
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


Example with {{from_pydict}} (from 
https://github.com/apache/arrow/pull/4601#issuecomment-503676534):

{code:python}
In [15]: table = pa.Table.from_pydict(
...: {'a': [1, 2, 3], 'b': [3, 4, 5]},
...: schema=pa.schema([('a', pa.int64()), ('c', pa.int32())]))

In [16]: table
Out[16]: 
pyarrow.Table
a: int64
c: int32

In [17]: table.to_pandas()
Out[17]: 
   a  c
0  1  3
1  2  0
2  3  4
{code}

Note that the specified schema has 1) different column names and 2) has a 
non-default type (int32 vs int64) which leads to corrupted values.

This is partly due to {{Table.from_pydict}} not using the type information in 
the schema to convert the dictionary items to pyarrow arrays. But then it is 
also {{Table.from_arrays}} that is not correctly casting the arrays to another 
dtype if the schema specifies as such.

Additional question for {{Table.pydict}} is whether it actually should override 
the 'b' key from the dictionary as column 'c' as defined in the schema (this 
behaviour depends on the order of the dictionary, which is not guaranteed below 
python 3.6).




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5630) [Python] Table of nested arrays doesn't round trip

2019-06-19 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16867981#comment-16867981
 ] 

Joris Van den Bossche commented on ARROW-5630:
--

It is somehow related to the length of the array. I see the same error as above 
on master, but when using 10x less data it works correctly.

> [Python] Table of nested arrays doesn't round trip
> --
>
> Key: ARROW-5630
> URL: https://issues.apache.org/jira/browse/ARROW-5630
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: pyarrow 0.13, Windows 10
>Reporter: Philip Felton
>Priority: Major
>  Labels: parquet
> Fix For: 0.14.0
>
>
> This is pyarrow 0.13 on Windows.
> {code:python}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> def make_table(num_rows):
> typ = pa.list_(pa.field("item", pa.float32(), False))
> return pa.Table.from_arrays([
> pa.array([[0] * (i%10) for i in range(0, num_rows)], type=typ),
> pa.array([[0] * ((i+5)%10) for i in range(0, num_rows)], type=typ)
> ], ['a', 'b'])
> pq.write_table(make_table(100), 'test.parquet')
> pq.read_table('test.parquet')
> {code}
> The last line throws the following exception:
> {noformat}
> ---
> ArrowInvalid  Traceback (most recent call last)
>  in 
> > 1 pq.read_table('full.parquet')
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read_table(source, 
> columns, use_threads, metadata, use_pandas_metadata, memory_map, filesystem)
>1150 return fs.read_parquet(path, columns=columns,
>1151use_threads=use_threads, 
> metadata=metadata,
> -> 1152
> use_pandas_metadata=use_pandas_metadata)
>1153 
>1154 pf = ParquetFile(source, metadata=metadata)
> ~\Anaconda3\lib\site-packages\pyarrow\filesystem.py in read_parquet(self, 
> path, columns, metadata, schema, use_threads, use_pandas_metadata)
> 179  filesystem=self)
> 180 return dataset.read(columns=columns, use_threads=use_threads,
> --> 181 use_pandas_metadata=use_pandas_metadata)
> 182 
> 183 def open(self, path, mode='rb'):
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, 
> use_threads, use_pandas_metadata)
>1012 table = piece.read(columns=columns, 
> use_threads=use_threads,
>1013partitions=self.partitions,
> -> 1014
> use_pandas_metadata=use_pandas_metadata)
>1015 tables.append(table)
>1016 
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, 
> use_threads, partitions, open_file_func, file, use_pandas_metadata)
> 562 table = reader.read_row_group(self.row_group, **options)
> 563 else:
> --> 564 table = reader.read(**options)
> 565 
> 566 if len(self.partition_keys) > 0:
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, 
> use_threads, use_pandas_metadata)
> 212 columns, use_pandas_metadata=use_pandas_metadata)
> 213 return self.reader.read_all(column_indices=column_indices,
> --> 214 use_threads=use_threads)
> 215 
> 216 def scan_contents(self, columns=None, batch_size=65536):
> ~\Anaconda3\lib\site-packages\pyarrow\_parquet.pyx in 
> pyarrow._parquet.ParquetReader.read_all()
> ~\Anaconda3\lib\site-packages\pyarrow\error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Column 1 named b expected length 932066 but got length 932063
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5630) [Python] Table of nested arrays doesn't round trip

2019-06-19 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16867988#comment-16867988
 ] 

Joris Van den Bossche commented on ARROW-5630:
--

Yes, with the default of nullable=True, I don't see the error.

> [Python] Table of nested arrays doesn't round trip
> --
>
> Key: ARROW-5630
> URL: https://issues.apache.org/jira/browse/ARROW-5630
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: pyarrow 0.13, Windows 10
>Reporter: Philip Felton
>Priority: Major
>  Labels: parquet
> Fix For: 0.14.0
>
>
> This is pyarrow 0.13 on Windows.
> {code:python}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> def make_table(num_rows):
> typ = pa.list_(pa.field("item", pa.float32(), False))
> return pa.Table.from_arrays([
> pa.array([[0] * (i%10) for i in range(0, num_rows)], type=typ),
> pa.array([[0] * ((i+5)%10) for i in range(0, num_rows)], type=typ)
> ], ['a', 'b'])
> pq.write_table(make_table(100), 'test.parquet')
> pq.read_table('test.parquet')
> {code}
> The last line throws the following exception:
> {noformat}
> ---
> ArrowInvalid  Traceback (most recent call last)
>  in 
> > 1 pq.read_table('full.parquet')
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read_table(source, 
> columns, use_threads, metadata, use_pandas_metadata, memory_map, filesystem)
>1150 return fs.read_parquet(path, columns=columns,
>1151use_threads=use_threads, 
> metadata=metadata,
> -> 1152
> use_pandas_metadata=use_pandas_metadata)
>1153 
>1154 pf = ParquetFile(source, metadata=metadata)
> ~\Anaconda3\lib\site-packages\pyarrow\filesystem.py in read_parquet(self, 
> path, columns, metadata, schema, use_threads, use_pandas_metadata)
> 179  filesystem=self)
> 180 return dataset.read(columns=columns, use_threads=use_threads,
> --> 181 use_pandas_metadata=use_pandas_metadata)
> 182 
> 183 def open(self, path, mode='rb'):
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, 
> use_threads, use_pandas_metadata)
>1012 table = piece.read(columns=columns, 
> use_threads=use_threads,
>1013partitions=self.partitions,
> -> 1014
> use_pandas_metadata=use_pandas_metadata)
>1015 tables.append(table)
>1016 
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, 
> use_threads, partitions, open_file_func, file, use_pandas_metadata)
> 562 table = reader.read_row_group(self.row_group, **options)
> 563 else:
> --> 564 table = reader.read(**options)
> 565 
> 566 if len(self.partition_keys) > 0:
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, 
> use_threads, use_pandas_metadata)
> 212 columns, use_pandas_metadata=use_pandas_metadata)
> 213 return self.reader.read_all(column_indices=column_indices,
> --> 214 use_threads=use_threads)
> 215 
> 216 def scan_contents(self, columns=None, batch_size=65536):
> ~\Anaconda3\lib\site-packages\pyarrow\_parquet.pyx in 
> pyarrow._parquet.ParquetReader.read_all()
> ~\Anaconda3\lib\site-packages\pyarrow\error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Column 1 named b expected length 932066 but got length 932063
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5630) [Python] Table of nested arrays doesn't round trip

2019-06-19 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16867997#comment-16867997
 ] 

Joris Van den Bossche commented on ARROW-5630:
--

Sure, I didn't yet look into it (and will certainly not tonight). I only ran 
the code snippet to see if I could reproduce it yesterday, but forgot to 
comment about it.

> [Python] Table of nested arrays doesn't round trip
> --
>
> Key: ARROW-5630
> URL: https://issues.apache.org/jira/browse/ARROW-5630
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: pyarrow 0.13, Windows 10
>Reporter: Philip Felton
>Priority: Major
>  Labels: parquet
> Fix For: 0.14.0
>
>
> This is pyarrow 0.13 on Windows.
> {code:python}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> def make_table(num_rows):
> typ = pa.list_(pa.field("item", pa.float32(), False))
> return pa.Table.from_arrays([
> pa.array([[0] * (i%10) for i in range(0, num_rows)], type=typ),
> pa.array([[0] * ((i+5)%10) for i in range(0, num_rows)], type=typ)
> ], ['a', 'b'])
> pq.write_table(make_table(100), 'test.parquet')
> pq.read_table('test.parquet')
> {code}
> The last line throws the following exception:
> {noformat}
> ---
> ArrowInvalid  Traceback (most recent call last)
>  in 
> > 1 pq.read_table('full.parquet')
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read_table(source, 
> columns, use_threads, metadata, use_pandas_metadata, memory_map, filesystem)
>1150 return fs.read_parquet(path, columns=columns,
>1151use_threads=use_threads, 
> metadata=metadata,
> -> 1152
> use_pandas_metadata=use_pandas_metadata)
>1153 
>1154 pf = ParquetFile(source, metadata=metadata)
> ~\Anaconda3\lib\site-packages\pyarrow\filesystem.py in read_parquet(self, 
> path, columns, metadata, schema, use_threads, use_pandas_metadata)
> 179  filesystem=self)
> 180 return dataset.read(columns=columns, use_threads=use_threads,
> --> 181 use_pandas_metadata=use_pandas_metadata)
> 182 
> 183 def open(self, path, mode='rb'):
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, 
> use_threads, use_pandas_metadata)
>1012 table = piece.read(columns=columns, 
> use_threads=use_threads,
>1013partitions=self.partitions,
> -> 1014
> use_pandas_metadata=use_pandas_metadata)
>1015 tables.append(table)
>1016 
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, 
> use_threads, partitions, open_file_func, file, use_pandas_metadata)
> 562 table = reader.read_row_group(self.row_group, **options)
> 563 else:
> --> 564 table = reader.read(**options)
> 565 
> 566 if len(self.partition_keys) > 0:
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, 
> use_threads, use_pandas_metadata)
> 212 columns, use_pandas_metadata=use_pandas_metadata)
> 213 return self.reader.read_all(column_indices=column_indices,
> --> 214 use_threads=use_threads)
> 215 
> 216 def scan_contents(self, columns=None, batch_size=65536):
> ~\Anaconda3\lib\site-packages\pyarrow\_parquet.pyx in 
> pyarrow._parquet.ParquetReader.read_all()
> ~\Anaconda3\lib\site-packages\pyarrow\error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Column 1 named b expected length 932066 but got length 932063
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-5630) [Python] Table of nested arrays doesn't round trip

2019-06-19 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16867997#comment-16867997
 ] 

Joris Van den Bossche edited comment on ARROW-5630 at 6/19/19 8:48 PM:
---

Sure, I didn't yet look into it (and will certainly not tonight). I only ran 
the code snippet to see if I could reproduce on master yesterday, but forgot to 
comment about it.


was (Author: jorisvandenbossche):
Sure, I didn't yet look into it (and will certainly not tonight). I only ran 
the code snippet to see if I could reproduce it yesterday, but forgot to 
comment about it.

> [Python] Table of nested arrays doesn't round trip
> --
>
> Key: ARROW-5630
> URL: https://issues.apache.org/jira/browse/ARROW-5630
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: pyarrow 0.13, Windows 10
>Reporter: Philip Felton
>Priority: Major
>  Labels: parquet
> Fix For: 0.14.0
>
>
> This is pyarrow 0.13 on Windows.
> {code:python}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> def make_table(num_rows):
> typ = pa.list_(pa.field("item", pa.float32(), False))
> return pa.Table.from_arrays([
> pa.array([[0] * (i%10) for i in range(0, num_rows)], type=typ),
> pa.array([[0] * ((i+5)%10) for i in range(0, num_rows)], type=typ)
> ], ['a', 'b'])
> pq.write_table(make_table(100), 'test.parquet')
> pq.read_table('test.parquet')
> {code}
> The last line throws the following exception:
> {noformat}
> ---
> ArrowInvalid  Traceback (most recent call last)
>  in 
> > 1 pq.read_table('full.parquet')
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read_table(source, 
> columns, use_threads, metadata, use_pandas_metadata, memory_map, filesystem)
>1150 return fs.read_parquet(path, columns=columns,
>1151use_threads=use_threads, 
> metadata=metadata,
> -> 1152
> use_pandas_metadata=use_pandas_metadata)
>1153 
>1154 pf = ParquetFile(source, metadata=metadata)
> ~\Anaconda3\lib\site-packages\pyarrow\filesystem.py in read_parquet(self, 
> path, columns, metadata, schema, use_threads, use_pandas_metadata)
> 179  filesystem=self)
> 180 return dataset.read(columns=columns, use_threads=use_threads,
> --> 181 use_pandas_metadata=use_pandas_metadata)
> 182 
> 183 def open(self, path, mode='rb'):
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, 
> use_threads, use_pandas_metadata)
>1012 table = piece.read(columns=columns, 
> use_threads=use_threads,
>1013partitions=self.partitions,
> -> 1014
> use_pandas_metadata=use_pandas_metadata)
>1015 tables.append(table)
>1016 
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, 
> use_threads, partitions, open_file_func, file, use_pandas_metadata)
> 562 table = reader.read_row_group(self.row_group, **options)
> 563 else:
> --> 564 table = reader.read(**options)
> 565 
> 566 if len(self.partition_keys) > 0:
> ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, 
> use_threads, use_pandas_metadata)
> 212 columns, use_pandas_metadata=use_pandas_metadata)
> 213 return self.reader.read_all(column_indices=column_indices,
> --> 214 use_threads=use_threads)
> 215 
> 216 def scan_contents(self, columns=None, batch_size=65536):
> ~\Anaconda3\lib\site-packages\pyarrow\_parquet.pyx in 
> pyarrow._parquet.ParquetReader.read_all()
> ~\Anaconda3\lib\site-packages\pyarrow\error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Column 1 named b expected length 932066 but got length 932063
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5665) ArrowInvalid on converting Pandas Series with dtype float64

2019-06-20 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16868572#comment-16868572
 ] 

Joris Van den Bossche commented on ARROW-5665:
--

[~tnesztler] Can you try to provide a reproducible example?

Based on the error message, it seems you have a column in your DataFrame that 
has Series objects as values in the rows. That's not support by pyarrow. 
If that is intentional, and you want to save them as a nested List type, then 
you need to convert the column of Series objects to a column of arrays or lists.

> ArrowInvalid on converting Pandas Series with dtype float64
> ---
>
> Key: ARROW-5665
> URL: https://issues.apache.org/jira/browse/ARROW-5665
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Thibaud Nesztler
>Priority: Minor
>
> {code:java}
> ('Could not convert 0 70.67\n0 73.00\n0 0.00\nName: fact_value, 
> dtype: float64 with type Series: did not recognize Python value type when 
> inferring an Arrow data type', 'Conversion failed for column fact_value with 
> type float64'){code}
> We are experiencing a lot of random errors (will run the same code and not 
> get the error at all) when converting Pandas Dataframe to parquet files using 
> pyarrow.
> We use this line of code for the convertion:
> {code:java}
> dataframe.to_parquet(filePath, compression="snappy", index=False){code}
> Note: `filePath` is an AWS S3 URI.
> {code:java}
> ArrowInvalid: ('Could not convert 0 70.67\n0 73.00\n0 0.00\nName: 
> fact_value, dtype: float64 with type Series: did not recognize Python value 
> type when inferring an Arrow data type', 'Conversion failed for column 
> fact_value with type float64')
>  File "store_manager.py", line 25, in _write_files_and_partitions
>  dataframe.to_parquet(filePath, compression="snappy", index=False)
>  File "pandas/core/frame.py", line 2203, in to_parquet
>  partition_cols=partition_cols, **kwargs)
>  File "pandas/io/parquet.py", line 252, in to_parquet
>  partition_cols=partition_cols, **kwargs)
>  File "pandas/io/parquet.py", line 113, in write
>  table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
>  File "pyarrow/table.pxi", line 1139, in pyarrow.lib.Table.from_pandas
>  names, arrays, metadata = dataframe_to_arrays(
>  File "pyarrow/pandas_compat.py", line 474, in dataframe_to_arrays
>  convert_types))
>  File "concurrent/futures/_base.py", line 586, in result_iterator
>  yield fs.pop().result()
>  File "concurrent/futures/_base.py", line 425, in result
>  return self.__get_result()
>  File "concurrent/futures/_base.py", line 384, in __get_result
>  raise self._exception
>  File "concurrent/futures/thread.py", line 57, in run
>  result = self.fn(*self.args, **self.kwargs)
>  File "pyarrow/pandas_compat.py", line 463, in convert_column
>  raise e
>  File "pyarrow/pandas_compat.py", line 457, in convert_column
>  return pa.array(col, type=ty, from_pandas=True, safe=safe)
>  File "pyarrow/array.pxi", line 173, in pyarrow.lib.array
>  return _sequence_to_array(obj, mask, size, type, pool, from_pandas)
>  File "pyarrow/array.pxi", line 36, in pyarrow.lib._sequence_to_array
>  check_status(ConvertPySequence(sequence, mask, options, &out))
>  File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
>  raise ArrowInvalid(message){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5665) [Python] ArrowInvalid on converting Pandas Series with dtype float64

2019-06-20 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-5665:
-
Summary: [Python] ArrowInvalid on converting Pandas Series with dtype 
float64  (was: ArrowInvalid on converting Pandas Series with dtype float64)

> [Python] ArrowInvalid on converting Pandas Series with dtype float64
> 
>
> Key: ARROW-5665
> URL: https://issues.apache.org/jira/browse/ARROW-5665
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Thibaud Nesztler
>Priority: Minor
>
> {code:java}
> ('Could not convert 0 70.67\n0 73.00\n0 0.00\nName: fact_value, 
> dtype: float64 with type Series: did not recognize Python value type when 
> inferring an Arrow data type', 'Conversion failed for column fact_value with 
> type float64'){code}
> We are experiencing a lot of random errors (will run the same code and not 
> get the error at all) when converting Pandas Dataframe to parquet files using 
> pyarrow.
> We use this line of code for the convertion:
> {code:java}
> dataframe.to_parquet(filePath, compression="snappy", index=False){code}
> Note: `filePath` is an AWS S3 URI.
> {code:java}
> ArrowInvalid: ('Could not convert 0 70.67\n0 73.00\n0 0.00\nName: 
> fact_value, dtype: float64 with type Series: did not recognize Python value 
> type when inferring an Arrow data type', 'Conversion failed for column 
> fact_value with type float64')
>  File "store_manager.py", line 25, in _write_files_and_partitions
>  dataframe.to_parquet(filePath, compression="snappy", index=False)
>  File "pandas/core/frame.py", line 2203, in to_parquet
>  partition_cols=partition_cols, **kwargs)
>  File "pandas/io/parquet.py", line 252, in to_parquet
>  partition_cols=partition_cols, **kwargs)
>  File "pandas/io/parquet.py", line 113, in write
>  table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
>  File "pyarrow/table.pxi", line 1139, in pyarrow.lib.Table.from_pandas
>  names, arrays, metadata = dataframe_to_arrays(
>  File "pyarrow/pandas_compat.py", line 474, in dataframe_to_arrays
>  convert_types))
>  File "concurrent/futures/_base.py", line 586, in result_iterator
>  yield fs.pop().result()
>  File "concurrent/futures/_base.py", line 425, in result
>  return self.__get_result()
>  File "concurrent/futures/_base.py", line 384, in __get_result
>  raise self._exception
>  File "concurrent/futures/thread.py", line 57, in run
>  result = self.fn(*self.args, **self.kwargs)
>  File "pyarrow/pandas_compat.py", line 463, in convert_column
>  raise e
>  File "pyarrow/pandas_compat.py", line 457, in convert_column
>  return pa.array(col, type=ty, from_pandas=True, safe=safe)
>  File "pyarrow/array.pxi", line 173, in pyarrow.lib.array
>  return _sequence_to_array(obj, mask, size, type, pool, from_pandas)
>  File "pyarrow/array.pxi", line 36, in pyarrow.lib._sequence_to_array
>  check_status(ConvertPySequence(sequence, mask, options, &out))
>  File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
>  raise ArrowInvalid(message){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5666) [Python] Underscores in partition (string) values are dropped when reading dataset

2019-06-20 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16868583#comment-16868583
 ] 

Joris Van den Bossche commented on ARROW-5666:
--

Thanks for the report!

The problem is that we try to convert the keys to integer, and if that fails 
just preserve them as strings. 
That is done here 
https://github.com/apache/arrow/blob/961927af56b83d0dbca91132c3f07aa06d69fc63/python/pyarrow/parquet.py#L659-L663

{code}
# Only integer and string partition types are supported right now
try:
integer_keys = [int(x) for x in self.keys]
dictionary = lib.array(integer_keys)
except ValueError:
dictionary = lib.array(self.keys)
{code}

and apparently, Python will convert a string with an underscore to an integer 
...

{code}
In [3]: int("2019_1")   

   
Out[3]: 20191
{code}

I think this is because in recent Python versions underscores are allowed in 
integer literals (eg to separate thousands). 
We could special case this and first check if there is an underscore in the 
string before trying to convert to integers, but that's a big ugly.

> [Python] Underscores in partition (string) values are dropped when reading 
> dataset
> --
>
> Key: ARROW-5666
> URL: https://issues.apache.org/jira/browse/ARROW-5666
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
>Reporter: Julian de Ruiter
>Priority: Major
>
> When reading a partitioned dataset, in which the partition column contains 
> string values with underscores, pyarrow seems to be ignoring the underscores 
> in the resulting values.
> For example if I write and then read a dataset as follows:
> {code:java}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({
>     "year_week": ["2019_2", "2019_3"],
>     "value": [1, 2]
> })
> table = pa.Table.from_pandas(df.head())
> pq.write_to_dataset(table, 'test', partition_cols=["year_week"])
> table2 = pq.ParquetDataset('test').read()
> {code}
> The resulting 'year_week' column in table 2 has lost the underscores:
> {code:java}
> table2[1] # Gives:
>  indices=int32, ordered=0>)>
> [
>   -- dictionary:
>     [
>   20192,
>   20193
>     ]
>   -- indices:
>     [
>   0
>     ],
>   -- dictionary:
>     [
>   20192,
>   20193
>     ]
>   -- indices:
>     [
>   1
>     ]
> ]
> {code}
> Is this intentional behaviour or is this a bug in arrow?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5666) [Python] Underscores in partition (string) values are dropped when reading dataset

2019-06-20 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-5666:
-
Labels: parquet  (was: )

> [Python] Underscores in partition (string) values are dropped when reading 
> dataset
> --
>
> Key: ARROW-5666
> URL: https://issues.apache.org/jira/browse/ARROW-5666
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
>Reporter: Julian de Ruiter
>Priority: Major
>  Labels: parquet
>
> When reading a partitioned dataset, in which the partition column contains 
> string values with underscores, pyarrow seems to be ignoring the underscores 
> in the resulting values.
> For example if I write and then read a dataset as follows:
> {code:java}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({
>     "year_week": ["2019_2", "2019_3"],
>     "value": [1, 2]
> })
> table = pa.Table.from_pandas(df.head())
> pq.write_to_dataset(table, 'test', partition_cols=["year_week"])
> table2 = pq.ParquetDataset('test').read()
> {code}
> The resulting 'year_week' column in table 2 has lost the underscores:
> {code:java}
> table2[1] # Gives:
>  indices=int32, ordered=0>)>
> [
>   -- dictionary:
>     [
>   20192,
>   20193
>     ]
>   -- indices:
>     [
>   0
>     ],
>   -- dictionary:
>     [
>   20192,
>   20193
>     ]
>   -- indices:
>     [
>   1
>     ]
> ]
> {code}
> Is this intentional behaviour or is this a bug in arrow?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2136) [Python] Non-nullable schema fields not checked in conversions from pandas

2019-06-21 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16869274#comment-16869274
 ] 

Joris Van den Bossche commented on ARROW-2136:
--

You can also run into this when using {{pa.Table.from_arrays}}

{code}
In [2]: schema = pa.schema([pa.field("a", pa.float64(), nullable=False)])

In [4]: table = pa.Table.from_arrays([pa.array([1.5, None])], schema=schema) 

In [5]: table.schema
Out[5]: a: double

In [6]: table.schema.field_by_name('a')
Out[6]: pyarrow.Field

In [7]: table.column('a')
Out[7]: 

[
  [
1.5,
null
  ]
]
{code}

Under the hood, this function is doing {{Column(field, array)}}, and the Column 
constructor is assuming the field datatype matches the array's datatype.

There is a {{Column::ValidateData()}}, but looking at the implementation, that 
only checks that the chunks' types equal the field type, and does not check any 
metadata such as nullability (although the doc comment says "Verify that the 
column's array data is consistent with the passed field's metadata")

> [Python] Non-nullable schema fields not checked in conversions from pandas
> --
>
> Key: ARROW-2136
> URL: https://issues.apache.org/jira/browse/ARROW-2136
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Joris Van den Bossche
>Priority: Major
> Fix For: 0.14.0
>
>
> If you provide a schema with {{nullable=False}} but pass a {{DataFrame}} 
> which in fact has nulls it appears the schema is ignored? I would expect an 
> error here.
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1.2, 2.1, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.float64(), nullable=False)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1.2,
>   2.1,
>   NA
> ]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2136) [Python] Non-nullable schema fields not checked in conversions from pandas

2019-06-21 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16869281#comment-16869281
 ] 

Joris Van den Bossche commented on ARROW-2136:
--

For {{Table.from_pandas}}, in the end, the actual conversion is done column by 
column with {{pa.array}}.

So the question for me is, do we want to bake this in into {{pa.array}} (eg add 
a `nullable=True` default keyword to {{pa.array}})? So that the underlying 
conversion raises when it cannot satisfy a {{nullable=False}}.

Alternatively, we could also rather easily verify in 
{{pandas_compat.dataframe_to_arrays}} (the function that calls {{pa.array}} for 
each column) that the arrays null_count does not conflict with the schema 
fields' metadata on this. Raising while converting in {{pa.array}} could of 
course raise earlier.

> [Python] Non-nullable schema fields not checked in conversions from pandas
> --
>
> Key: ARROW-2136
> URL: https://issues.apache.org/jira/browse/ARROW-2136
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Joris Van den Bossche
>Priority: Major
> Fix For: 0.14.0
>
>
> If you provide a schema with {{nullable=False}} but pass a {{DataFrame}} 
> which in fact has nulls it appears the schema is ignored? I would expect an 
> error here.
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1.2, 2.1, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.float64(), nullable=False)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1.2,
>   2.1,
>   NA
> ]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-5668) [Python] Display "not null" in Schema.__repr__ for non-nullable fields

2019-06-21 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-5668:


Assignee: Joris Van den Bossche

> [Python] Display "not null" in Schema.__repr__ for non-nullable fields
> --
>
> Key: ARROW-5668
> URL: https://issues.apache.org/jira/browse/ARROW-5668
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Joris Van den Bossche
>Priority: Major
> Fix For: 0.14.0
>
>
> Minor usability improvement
> {code}
> schema = pa.schema([pa.field('a', pa.int64(), nullable=False)])
> In [11]: schema   
> 
> Out[11]: a: int64
> In [12]: schema[0]
> 
> Out[12]: pyarrow.Field
> {code}
> I'd like to see
> {code}
> In [11]: schema   
> 
> Out[11]: a: int64 not null
> {code}
> or similar



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3176) [Python] Overflow in Date32 column conversion to pandas

2019-06-24 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870886#comment-16870886
 ] 

Joris Van den Bossche commented on ARROW-3176:
--

I fixed the issue on the pandas side, meaning that you no longer get a 
incorrect date (eg the "1677-09-21 00:25:26.290448384" instead of 
"2262-04-12"), but an error: 

{code}
In [7]: pa.column('name', arr).to_pandas(date_as_object=False) 
...
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2262-04-12 00:00:00

In [8]: pa.column('name', arr).to_pandas(date_as_object=True) 
Out[8]: 
02262-04-12
Name: name, dtype: object
{code}

since pandas only supports datetime64[ns] at the moment, I think that is the 
best we can do: you get an error with dates that are out of bound for 
datetime64[ns], and in that case you should use the default 
{{date_as_object=True}}.

This will be fixed in pandas 0.25.0

For me, this issue can then be closed as we can rely on this fixed pandas 
behaviour.  
One alternative that could be done in pyarrow is to detect out of bound dates, 
and then always return objects instead of datetime64. But since that is already 
the default behaviour to always return objects, I personally don't think we 
should "ignore" the user-specified keyword {{date_as_object=False}} in those 
cases.

> [Python] Overflow in Date32 column conversion to pandas
> ---
>
> Key: ARROW-3176
> URL: https://issues.apache.org/jira/browse/ARROW-3176
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0
>Reporter: Florian Jetter
>Priority: Minor
> Fix For: 1.0.0
>
>
> When converting an arrow column holding a {{Date32Array}} to {{pandas}} there 
> seems to be an overflow at the date {{2262-04-12}} such that the type and 
> value are wrong. The issue only occurs for columns, not for arrays.
> Running on debian 9.5 w/ python2 gives
>   
> {code}
> In [1]: import numpy as np
> In [2]: import datetime
> In [3]: import pyarrow as pa
> In [4]: pa.__version__
> Out[4]: '0.10.0'
> In [5]: arr = pa.array(np.array([datetime.date(2262, 4, 12)], 
> dtype='datetime64[D]'))
> In [6]: arr.to_pandas(date_as_object=False)
> Out[6]: array(['2262-04-12'], dtype='datetime64[D]')
> In [7]: pa.column('name', arr).to_pandas(date_as_object=False)
> Out[7]:
> 0 1677-09-21 00:25:26.290448384
> Name: name, dtype: datetime64[ns]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5655) [Python] Table.from_pydict/from_arrays not using types in specified schema correctly

2019-06-25 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-5655:
-
Fix Version/s: 1.0.0

> [Python] Table.from_pydict/from_arrays not using types in specified schema 
> correctly 
> -
>
> Key: ARROW-5655
> URL: https://issues.apache.org/jira/browse/ARROW-5655
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
> Fix For: 1.0.0
>
>
> Example with {{from_pydict}} (from 
> https://github.com/apache/arrow/pull/4601#issuecomment-503676534):
> {code:python}
> In [15]: table = pa.Table.from_pydict(
> ...: {'a': [1, 2, 3], 'b': [3, 4, 5]},
> ...: schema=pa.schema([('a', pa.int64()), ('c', pa.int32())]))
> In [16]: table
> Out[16]: 
> pyarrow.Table
> a: int64
> c: int32
> In [17]: table.to_pandas()
> Out[17]: 
>a  c
> 0  1  3
> 1  2  0
> 2  3  4
> {code}
> Note that the specified schema has 1) different column names and 2) has a 
> non-default type (int32 vs int64) which leads to corrupted values.
> This is partly due to {{Table.from_pydict}} not using the type information in 
> the schema to convert the dictionary items to pyarrow arrays. But then it is 
> also {{Table.from_arrays}} that is not correctly casting the arrays to 
> another dtype if the schema specifies as such.
> Additional question for {{Table.pydict}} is whether it actually should 
> override the 'b' key from the dictionary as column 'c' as defined in the 
> schema (this behaviour depends on the order of the dictionary, which is not 
> guaranteed below python 3.6).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5811) [C++] CSV reader: Ability to not infer column types.

2019-07-04 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-5811:
-
Component/s: C++

> [C++] CSV reader: Ability to not infer column types.
> 
>
> Key: ARROW-5811
> URL: https://issues.apache.org/jira/browse/ARROW-5811
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.13.0
> Environment: Ubuntu Xenial
>Reporter: Bogdan Klichuk
>Priority: Minor
>  Labels: csv, csvparser, pyarrow
> Fix For: 1.0.0
>
>
> I'm trying to read CSV as is. All columns as strings. I don't know the schema 
> of these CSVs and they will vary as they are provided by user.
> Right now i'm using pandas.read_csv(dtype=str) which works great, but since 
> final destination of these CSVs are parquet files it seems like much more 
> efficient to use pyarrow.csv.read_csv in future, as soon as this becomes 
> available :)
> I tried things like 
> `pyarrow.csv.read_csv(convert_types=ConvertOptions(columns_types=defaultdict(lambda:
>  'string')))` but it doesn't work.
> Maybe I just didnt' find something that already exists? :)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5811) [C++] CSV reader: Ability to not infer column types.

2019-07-04 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-5811:
-
Summary: [C++] CSV reader: Ability to not infer column types.  (was: 
[Python] pyarrow.csv.read_csv: Ability to not infer column types.)

> [C++] CSV reader: Ability to not infer column types.
> 
>
> Key: ARROW-5811
> URL: https://issues.apache.org/jira/browse/ARROW-5811
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.13.0
> Environment: Ubuntu Xenial
>Reporter: Bogdan Klichuk
>Priority: Minor
>  Labels: csv, csvparser, pyarrow
> Fix For: 1.0.0
>
>
> I'm trying to read CSV as is. All columns as strings. I don't know the schema 
> of these CSVs and they will vary as they are provided by user.
> Right now i'm using pandas.read_csv(dtype=str) which works great, but since 
> final destination of these CSVs are parquet files it seems like much more 
> efficient to use pyarrow.csv.read_csv in future, as soon as this becomes 
> available :)
> I tried things like 
> `pyarrow.csv.read_csv(convert_types=ConvertOptions(columns_types=defaultdict(lambda:
>  'string')))` but it doesn't work.
> Maybe I just didnt' find something that already exists? :)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3408) [C++] Add option to CSV reader to dictionary encode individual columns or all string / binary columns

2019-07-04 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-3408:
-
Labels: csv datasets  (was: datasets)

> [C++] Add option to CSV reader to dictionary encode individual columns or all 
> string / binary columns
> -
>
> Key: ARROW-3408
> URL: https://issues.apache.org/jira/browse/ARROW-3408
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: csv, datasets
> Fix For: 1.0.0
>
>
> For many datasets, dictionary encoding everything can result in drastically 
> lower memory usage and subsequently better performance in doing analytics
> One difficulty of dictionary encoding in multithreaded conversions is that 
> ideally you end up with one dictionary at the end. So you have two options:
> * Implement a concurrent hashing scheme -- for low cardinality dictionaries, 
> the overhead associated with mutex contention will not be meaningful, for 
> high cardinality it can be more of a problem
> * Hash each chunk separately, then normalize at the end
> My guess is that a crude concurrent hash table with a mutex to protect 
> mutations and resizes is going to outperform the latter



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3378) [C++] Implement whitespace CSV tokenizer

2019-07-04 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-3378:
-
Labels: csv  (was: )

> [C++] Implement whitespace CSV tokenizer
> 
>
> Key: ARROW-3378
> URL: https://issues.apache.org/jira/browse/ARROW-3378
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: csv
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5825) [Python] Exceptions swallowed in ParquetManifest._visit_directories

2019-07-04 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-5825:
-
Labels: parquet  (was: Parquet)

> [Python] Exceptions swallowed in ParquetManifest._visit_directories
> ---
>
> Key: ARROW-5825
> URL: https://issues.apache.org/jira/browse/ARROW-5825
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: George Sakkis
>Priority: Major
>  Labels: parquet
>
> {{ParquetManifest._visit_directories}} uses a {{ThreadPoolExecutor}} to visit 
> partitioned parquet datasets concurrently, it waits for them to finish but 
> doesn't check if the respective futures have failed or not. This is quite 
> tricky to detect and debug as an exception is either raised later as a a 
> side-effect or (perhaps worse) it passes silently.
> Observed on 0.12.1 but appears to be on latest master too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5825) [Python] Exceptions swallowed in ParquetManifest._visit_directories

2019-07-04 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16878763#comment-16878763
 ] 

Joris Van den Bossche commented on ARROW-5825:
--

[~gsakkis] do you have a reproducible example? You need a parquet dataset with 
eg an invalid file that raises an error when reading the metadata?

> [Python] Exceptions swallowed in ParquetManifest._visit_directories
> ---
>
> Key: ARROW-5825
> URL: https://issues.apache.org/jira/browse/ARROW-5825
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: George Sakkis
>Priority: Major
>  Labels: Parquet
>
> {{ParquetManifest._visit_directories}} uses a {{ThreadPoolExecutor}} to visit 
> partitioned parquet datasets concurrently, it waits for them to finish but 
> doesn't check if the respective futures have failed or not. This is quite 
> tricky to detect and debug as an exception is either raised later as a a 
> side-effect or (perhaps worse) it passes silently.
> Observed on 0.12.1 but appears to be on latest master too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-5817) [Python] Use pytest marks for Flight test to avoid silently skipping unit tests due to import failures

2019-07-04 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-5817:


Assignee: Joris Van den Bossche

> [Python] Use pytest marks for Flight test to avoid silently skipping unit 
> tests due to import failures
> --
>
> Key: ARROW-5817
> URL: https://issues.apache.org/jira/browse/ARROW-5817
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The approach used to determine whether or not Flight has been built will fail 
> silently if the extension is built but there is an ImportError caused by 
> linking or other issues 
> https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_flight.py#L35
> We should use the same "auto" approach as other optional components (see 
> https://github.com/apache/arrow/blob/master/python/pyarrow/tests/conftest.py#L40)
>  with the option for forced opt-in (so that ImportError does not cause 
> silently skipping) so that {{--flight}} will force the tests to run if we 
> expect them to work



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


<    1   2   3   4   5   6   7   8   9   10   >