[jira] [Commented] (ARROW-5666) [Python] Underscores in partition (string) values are dropped when reading dataset

2020-03-17 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060980#comment-17060980
 ] 

Joris Van den Bossche commented on ARROW-5666:
--

With the new API, you can also specify the exact schema of the partition keys 
in case the default inference is not what you want (eg specify string in case 
it looks like integers but you want to keep it as string)

For example:

{code}
part = ds.partitioning(pa.schema([("year_week", pa.int64())]), flavor="hive")   
dataset = ds.dataset("test", format="parquet", partitioning=part)  
{code}

(for the actual example code this doesn't work, because "2019_2" is no longer 
parsable as int now, and then it gives all nulls instead of raising an error, 
going to open a separate issue about that)

> [Python] Underscores in partition (string) values are dropped when reading 
> dataset
> --
>
> Key: ARROW-5666
> URL: https://issues.apache.org/jira/browse/ARROW-5666
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
>Reporter: Julian de Ruiter
>Priority: Major
>  Labels: dataset-parquet-read, parquet
>
> When reading a partitioned dataset, in which the partition column contains 
> string values with underscores, pyarrow seems to be ignoring the underscores 
> in the resulting values.
> For example if I write and then read a dataset as follows:
> {code:java}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({
>     "year_week": ["2019_2", "2019_3"],
>     "value": [1, 2]
> })
> table = pa.Table.from_pandas(df.head())
> pq.write_to_dataset(table, 'test', partition_cols=["year_week"])
> table2 = pq.ParquetDataset('test').read()
> {code}
> The resulting 'year_week' column in table 2 has lost the underscores:
> {code:java}
> table2[1] # Gives:
>  indices=int32, ordered=0>)>
> [
>   -- dictionary:
>     [
>   20192,
>   20193
>     ]
>   -- indices:
>     [
>   0
>     ],
>   -- dictionary:
>     [
>   20192,
>   20193
>     ]
>   -- indices:
>     [
>   1
>     ]
> ]
> {code}
> Is this intentional behaviour or is this a bug in arrow?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5666) [Python] Underscores in partition (string) values are dropped when reading dataset

2020-03-17 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060975#comment-17060975
 ] 

Joris Van den Bossche commented on ARROW-5666:
--

This now works with the new Datasets API:

{code}
In [2]: import pyarrow.dataset as ds

In [3]: dataset = ds.dataset("test/", format="parquet", partitioning="hive")

   

In [4]: dataset.schema  

   
Out[4]: 
value: int64
year_week: string

In [5]: dataset.to_table().to_pandas()  

   
Out[5]: 
   value year_week
0  12019_2
1  22019_3
{code}

So once we start using this new code in the parquet module (ARROW-8039), this 
issue should get resolved.

> [Python] Underscores in partition (string) values are dropped when reading 
> dataset
> --
>
> Key: ARROW-5666
> URL: https://issues.apache.org/jira/browse/ARROW-5666
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
>Reporter: Julian de Ruiter
>Priority: Major
>  Labels: dataset-parquet-read, parquet
>
> When reading a partitioned dataset, in which the partition column contains 
> string values with underscores, pyarrow seems to be ignoring the underscores 
> in the resulting values.
> For example if I write and then read a dataset as follows:
> {code:java}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({
>     "year_week": ["2019_2", "2019_3"],
>     "value": [1, 2]
> })
> table = pa.Table.from_pandas(df.head())
> pq.write_to_dataset(table, 'test', partition_cols=["year_week"])
> table2 = pq.ParquetDataset('test').read()
> {code}
> The resulting 'year_week' column in table 2 has lost the underscores:
> {code:java}
> table2[1] # Gives:
>  indices=int32, ordered=0>)>
> [
>   -- dictionary:
>     [
>   20192,
>   20193
>     ]
>   -- indices:
>     [
>   0
>     ],
>   -- dictionary:
>     [
>   20192,
>   20193
>     ]
>   -- indices:
>     [
>   1
>     ]
> ]
> {code}
> Is this intentional behaviour or is this a bug in arrow?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5666) [Python] Underscores in partition (string) values are dropped when reading dataset

2019-06-25 Thread Julian de Ruiter (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872692#comment-16872692
 ] 

Julian de Ruiter commented on ARROW-5666:
-

Ideally you would want to preserve the data type of the partition columns, but 
that's going to be hard to do properly if you also want to include other types 
such as dates, etc. Maybe it would be most consistent to just read partitioned 
columns as a categorical type and let the user handle the details, as this 
makes the code less magic than what happens now.

> [Python] Underscores in partition (string) values are dropped when reading 
> dataset
> --
>
> Key: ARROW-5666
> URL: https://issues.apache.org/jira/browse/ARROW-5666
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
>Reporter: Julian de Ruiter
>Priority: Major
>  Labels: parquet
>
> When reading a partitioned dataset, in which the partition column contains 
> string values with underscores, pyarrow seems to be ignoring the underscores 
> in the resulting values.
> For example if I write and then read a dataset as follows:
> {code:java}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({
>     "year_week": ["2019_2", "2019_3"],
>     "value": [1, 2]
> })
> table = pa.Table.from_pandas(df.head())
> pq.write_to_dataset(table, 'test', partition_cols=["year_week"])
> table2 = pq.ParquetDataset('test').read()
> {code}
> The resulting 'year_week' column in table 2 has lost the underscores:
> {code:java}
> table2[1] # Gives:
>  indices=int32, ordered=0>)>
> [
>   -- dictionary:
>     [
>   20192,
>   20193
>     ]
>   -- indices:
>     [
>   0
>     ],
>   -- dictionary:
>     [
>   20192,
>   20193
>     ]
>   -- indices:
>     [
>   1
>     ]
> ]
> {code}
> Is this intentional behaviour or is this a bug in arrow?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5666) [Python] Underscores in partition (string) values are dropped when reading dataset

2019-06-24 Thread JIRA


[ 
https://issues.apache.org/jira/browse/ARROW-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870952#comment-16870952
 ] 

Robin Kåveland commented on ARROW-5666:
---

That's just a slight modification to the expression above:
{code:java}
all(key.isdigit() or (key.startswith('-') and key[1:].isdigit()) for key in 
self.keys){code}
But it's starting to feel like a bad idea to attempt to do this coercing at 
all. Maybe it's better to force the user to deal with what type the partition 
key should have? If it were to be interpreted as something that looks a bit 
like {{pd.Categorical}}, it would be relatively cheap to read it into memory 
and deal with it after the user has read the file?

> [Python] Underscores in partition (string) values are dropped when reading 
> dataset
> --
>
> Key: ARROW-5666
> URL: https://issues.apache.org/jira/browse/ARROW-5666
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
>Reporter: Julian de Ruiter
>Priority: Major
>  Labels: parquet
>
> When reading a partitioned dataset, in which the partition column contains 
> string values with underscores, pyarrow seems to be ignoring the underscores 
> in the resulting values.
> For example if I write and then read a dataset as follows:
> {code:java}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({
>     "year_week": ["2019_2", "2019_3"],
>     "value": [1, 2]
> })
> table = pa.Table.from_pandas(df.head())
> pq.write_to_dataset(table, 'test', partition_cols=["year_week"])
> table2 = pq.ParquetDataset('test').read()
> {code}
> The resulting 'year_week' column in table 2 has lost the underscores:
> {code:java}
> table2[1] # Gives:
>  indices=int32, ordered=0>)>
> [
>   -- dictionary:
>     [
>   20192,
>   20193
>     ]
>   -- indices:
>     [
>   0
>     ],
>   -- dictionary:
>     [
>   20192,
>   20193
>     ]
>   -- indices:
>     [
>   1
>     ]
> ]
> {code}
> Is this intentional behaviour or is this a bug in arrow?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5666) [Python] Underscores in partition (string) values are dropped when reading dataset

2019-06-21 Thread Julian de Ruiter (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16869345#comment-16869345
 ] 

Julian de Ruiter commented on ARROW-5666:
-

What about negative values though? isdigit returns false for '-10' for example, 
whilst int(-10) returns an int.

> [Python] Underscores in partition (string) values are dropped when reading 
> dataset
> --
>
> Key: ARROW-5666
> URL: https://issues.apache.org/jira/browse/ARROW-5666
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
>Reporter: Julian de Ruiter
>Priority: Major
>  Labels: parquet
>
> When reading a partitioned dataset, in which the partition column contains 
> string values with underscores, pyarrow seems to be ignoring the underscores 
> in the resulting values.
> For example if I write and then read a dataset as follows:
> {code:java}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({
>     "year_week": ["2019_2", "2019_3"],
>     "value": [1, 2]
> })
> table = pa.Table.from_pandas(df.head())
> pq.write_to_dataset(table, 'test', partition_cols=["year_week"])
> table2 = pq.ParquetDataset('test').read()
> {code}
> The resulting 'year_week' column in table 2 has lost the underscores:
> {code:java}
> table2[1] # Gives:
>  indices=int32, ordered=0>)>
> [
>   -- dictionary:
>     [
>   20192,
>   20193
>     ]
>   -- indices:
>     [
>   0
>     ],
>   -- dictionary:
>     [
>   20192,
>   20193
>     ]
>   -- indices:
>     [
>   1
>     ]
> ]
> {code}
> Is this intentional behaviour or is this a bug in arrow?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5666) [Python] Underscores in partition (string) values are dropped when reading dataset

2019-06-21 Thread Julian de Ruiter (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16869341#comment-16869341
 ] 

Julian de Ruiter commented on ARROW-5666:
-

I think `isdigit` might be a good solution, it would also guard against 
silently casting (for example) floats to integers.

> [Python] Underscores in partition (string) values are dropped when reading 
> dataset
> --
>
> Key: ARROW-5666
> URL: https://issues.apache.org/jira/browse/ARROW-5666
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
>Reporter: Julian de Ruiter
>Priority: Major
>  Labels: parquet
>
> When reading a partitioned dataset, in which the partition column contains 
> string values with underscores, pyarrow seems to be ignoring the underscores 
> in the resulting values.
> For example if I write and then read a dataset as follows:
> {code:java}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({
>     "year_week": ["2019_2", "2019_3"],
>     "value": [1, 2]
> })
> table = pa.Table.from_pandas(df.head())
> pq.write_to_dataset(table, 'test', partition_cols=["year_week"])
> table2 = pq.ParquetDataset('test').read()
> {code}
> The resulting 'year_week' column in table 2 has lost the underscores:
> {code:java}
> table2[1] # Gives:
>  indices=int32, ordered=0>)>
> [
>   -- dictionary:
>     [
>   20192,
>   20193
>     ]
>   -- indices:
>     [
>   0
>     ],
>   -- dictionary:
>     [
>   20192,
>   20193
>     ]
>   -- indices:
>     [
>   1
>     ]
> ]
> {code}
> Is this intentional behaviour or is this a bug in arrow?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5666) [Python] Underscores in partition (string) values are dropped when reading dataset

2019-06-20 Thread JIRA


[ 
https://issues.apache.org/jira/browse/ARROW-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16868924#comment-16868924
 ] 

Robin Kåveland commented on ARROW-5666:
---

Maybe it's an option to check {{all(key.isdigit() for key in self.keys)}} ? 
Might also be ugly, but apparently:

 
{code:java}
>>> '1231_2'.isdigit()
False
>>> int('123_2')
1232
>>> '1231'.isdigit()
True{code}
 

This breaks stuff like base 16 partition keys, but at least the original data 
is still unchanged, so the user can coerce themselves.

> [Python] Underscores in partition (string) values are dropped when reading 
> dataset
> --
>
> Key: ARROW-5666
> URL: https://issues.apache.org/jira/browse/ARROW-5666
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
>Reporter: Julian de Ruiter
>Priority: Major
>  Labels: parquet
>
> When reading a partitioned dataset, in which the partition column contains 
> string values with underscores, pyarrow seems to be ignoring the underscores 
> in the resulting values.
> For example if I write and then read a dataset as follows:
> {code:java}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({
>     "year_week": ["2019_2", "2019_3"],
>     "value": [1, 2]
> })
> table = pa.Table.from_pandas(df.head())
> pq.write_to_dataset(table, 'test', partition_cols=["year_week"])
> table2 = pq.ParquetDataset('test').read()
> {code}
> The resulting 'year_week' column in table 2 has lost the underscores:
> {code:java}
> table2[1] # Gives:
>  indices=int32, ordered=0>)>
> [
>   -- dictionary:
>     [
>   20192,
>   20193
>     ]
>   -- indices:
>     [
>   0
>     ],
>   -- dictionary:
>     [
>   20192,
>   20193
>     ]
>   -- indices:
>     [
>   1
>     ]
> ]
> {code}
> Is this intentional behaviour or is this a bug in arrow?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5666) [Python] Underscores in partition (string) values are dropped when reading dataset

2019-06-20 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16868583#comment-16868583
 ] 

Joris Van den Bossche commented on ARROW-5666:
--

Thanks for the report!

The problem is that we try to convert the keys to integer, and if that fails 
just preserve them as strings. 
That is done here 
https://github.com/apache/arrow/blob/961927af56b83d0dbca91132c3f07aa06d69fc63/python/pyarrow/parquet.py#L659-L663

{code}
# Only integer and string partition types are supported right now
try:
integer_keys = [int(x) for x in self.keys]
dictionary = lib.array(integer_keys)
except ValueError:
dictionary = lib.array(self.keys)
{code}

and apparently, Python will convert a string with an underscore to an integer 
...

{code}
In [3]: int("2019_1")   

   
Out[3]: 20191
{code}

I think this is because in recent Python versions underscores are allowed in 
integer literals (eg to separate thousands). 
We could special case this and first check if there is an underscore in the 
string before trying to convert to integers, but that's a big ugly.

> [Python] Underscores in partition (string) values are dropped when reading 
> dataset
> --
>
> Key: ARROW-5666
> URL: https://issues.apache.org/jira/browse/ARROW-5666
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
>Reporter: Julian de Ruiter
>Priority: Major
>
> When reading a partitioned dataset, in which the partition column contains 
> string values with underscores, pyarrow seems to be ignoring the underscores 
> in the resulting values.
> For example if I write and then read a dataset as follows:
> {code:java}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({
>     "year_week": ["2019_2", "2019_3"],
>     "value": [1, 2]
> })
> table = pa.Table.from_pandas(df.head())
> pq.write_to_dataset(table, 'test', partition_cols=["year_week"])
> table2 = pq.ParquetDataset('test').read()
> {code}
> The resulting 'year_week' column in table 2 has lost the underscores:
> {code:java}
> table2[1] # Gives:
>  indices=int32, ordered=0>)>
> [
>   -- dictionary:
>     [
>   20192,
>   20193
>     ]
>   -- indices:
>     [
>   0
>     ],
>   -- dictionary:
>     [
>   20192,
>   20193
>     ]
>   -- indices:
>     [
>   1
>     ]
> ]
> {code}
> Is this intentional behaviour or is this a bug in arrow?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)